Beginning of the End

Beginning of the End … Or the End of the

Beginning?

The past few years have been challenging for Good Tunes & More (GT&M), a

business that traces its roots to Good Tunes, a store that exclusively sold music

CDs and vinyl records.

GT&M first broadened its merchandise to include home entertainment

and computer systems (the “More”), and then undertook an expansion to take

advantage of prime locations left empty by bankrupt former competitors. Today,

GT&M finds itself at a crossroads. Hoped-for increases in revenues that have

failed to occur and declining profit margins due to the competitive pressures of

online sellers have led management to reconsider the future of the business.

While some investors in the business have argued for an orderly retreat,

closing

stores and limiting the variety of merchandise, GT&M CEO Emma Levia

has decided to “double down” and expand the business

by purchasing Whitney

Wireless, a successful three-store chain that sells smartphones

and other mobile

devices.

Levia foresees creating a brand new “A-to-Z” electronics retailer but

first must establish a fair and reasonable price for the privately held Whitney

Wireless.

To do so, she has asked a group of analysts to identify the data that

would be helpful in setting a price for the wireless business. As part of that

group, you quickly realize that you need the data that would help to verify the

contents of the wireless company’s basic financial statements.

You focus on data associated with the company’s profit and loss statement

and quickly realize the need for sales and expense-related

variables.

You begin to

think about what the data for

such variables would look

like and how to collect those

data. You realize that you are

starting to apply the DCOVA

framework to the objective

of helping Levia acquire

Whitney Wireless.

Chapter Defining and

1 Collecting Data

Tyler Olson/Shutterstock

contents

1.1 Defining Variables

1.2 Collecting Data

1.3 Types of Sampling Methods

1.4 Types of Survey Errors

Think About This: New Media

Surveys/Old Sampling Problems

Using Statistics: Beginning of

the End … Revisited

Chapter 1 Excel Guide

Chapter 1 Minitab Guide

Objectives

Understand issues that arise

when defining variables

How to define variables

How to collect data

Identify the different ways to

collect a sample

Understand the types of

survey errors

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.1 Defining Variables 11

When Emma Levia decides to purchase Whitney Wireless, she has defined a new

goal or business objective for GT&M. Business objectives can arise from any

level of management and can be as varied as the following:

• A marketing analyst needs to assess the effectiveness of a new online advertising campaign.

• A pharmaceutical company needs to determine whether a new drug is more effective

than those currently in use.

• An operations manager wants to improve a manufacturing or service process.

• An auditor needs to review a company’s financial transactions to determine whether the

company is in compliance with generally accepted accounting principles.

Establishing an objective marks the end of a problem definition process. This end triggers

the new process of identifying the correct data to support the objective. In the GT&M scenario,

having decided to buy Whitney Wireless, Levia needs to identify the data that would be helpful

in setting a price for the wireless business. This process of identifying the correct data triggers

the start of applying the tasks of the DCOVA framework. In other words, the end of problem

definition marks the beginning of applying statistics to business decision making.

Identifying the correct data to support a business objective is a two-part job that requires

defining variables and collecting the data for those variables. These tasks are the first two tasks

of the DCOVA framework first defined in Section GS.1 and which can be restated here as:

• Define the variables that you want to study to solve a problem or meet an objective.

• Collect the data for those variables from appropriate sources.

This chapter discusses these two tasks which must always be done before the OrganizeVisualize,

and Analyze tasks.

Defining variables at first may seem to be the simple process of making the list of things one

needs to help solve a problem or meet an objective. However, consider the GT&M scenario.

Most would quickly agree that yearly sales of Whitney Wireless would be part of the data

needed to meet Levia’s objective, but just placing “yearly sales” on a list could lead to confusion

and miscommunication: Does this variable refer to sales per year for the entire chain or

for individual stores? Does the variable refer to net or gross sales? Are the yearly sales values

expressed in number of units or as currency amounts such as U.S. dollar sales?

These questions illustrate that for each variable of interest that you identify you must supply

an operational definition, a universally accepted meaning that is clear to all associated

with an analysis. Operational definitions should also classify the variable, as explained in the

next section, and may include additional facts such as units of measures, allowed range of

values, and definitions of specific variable values, depending on how the variable is classified.

Classifying Variables by Type

When you operationally define a variable, you must classify the variable as being either categorical

or numerical. Categorical variables (also known as qualitative variables) take categories

as their values. Numerical variables (also known as quantitative variables) have values

that represent a counted or measured quantity. Classification also affects a variable’s operational

definition and getting the classification correct is important because certain statistical methods

can be applied correctly to one type or the other, while other methods may need a specific mix

of variable types.

Categorical variables can take the form of yes-and-no questions such as “Do you have a

Twitter account?” (in which yes and no form the variable’s two categories) or describe a trait

or characteristic that has many categories such as undergraduate class standing (which might

have the defined categories freshman, sophomore, junior, and senior). When defining a categorical

variable, the list of permissible category values must be included and each category

1.1 Defining Variables

Student Tip

Providing operational

definitions for concepts

is important, too, when

writing a textbook! The

end-of-chapter Key

Terms gives you an index

of operational definitions

and the most fundamental

definitions are

presented in boxes such

as the page 3 box that

defines variable and data.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

12 Chapter 1 Defining and Collecting Data

value should be defined, too, e.g., that a “freshman” is a student who has completed fewer

than 32 credit hours. Overlooking these requirements can lead to confusion and incorrect data

collection. In one famous example, when persons were asked by researchers to fill in a value

for the categorical variable sex, many answered yes and not male or female, the values that the

researchers intended. (Perhaps this is the reason that gender has replaced sex on many data collection

forms—gender’s operational definition is more self-apparent.)

The operational definitions of numerical variables are affected by whether the variable being

defined is discrete or continuous. Discrete variables such as “number of items purchased”

or “total amount paid” are numerical values that arise from a counting process. Continuous

variables such as “time spent on checkout line” or “distance from home to store” have numerical

values that arise from a measuring process and those values depend on the precision of the

measuring instrument used. For example, “time spent on checkout line” might be 2, 2.1, 2.14,

or 2.143 minutes, depending on the precision of the timing instrument being used. Units of

measures and the level of precision should be part of the operational definitions of continuous

variables, e.g., “tenths of a second” for “time spent on checkout line.” The definitions of any

numerical variable can include the allowed range of values, such as “must be greater than 0”

for “number of items purchased.”

When defining variables for survey collection (discussed in Section 1.2), thinking about

the responses you seek helps classify variables as Table 1.1 demonstrates. Thinking about how

a variable will be used to solve a problem or meet an objective can also be helpful when you

define a variable. The variable age might be a numerical (discrete) variable in some cases or

might be categorical with categories such as child, young adult, middle-aged, and retirement

aged in other contexts.

Problems for Section 1.1

Learning the Basics

1.1 Four different beverages are sold at a fast-food restaurant:

soft drinks, tea, coffee, and bottled water. Explain why the

type of beverage sold is an example of a categorical variable.

1.2 U.S. businesses are listed by size: small, medium, and large. Explain

why business size is an example of a categorical variable.

1.3 The time it takes to download a video from the Internet is

measured. Explain why the download time is a continuous

numerical variable.

Applying the Concepts

SELF

Test

1.4 For each of the following variables, determine

whether the variable is categorical or numerical. If the

variable is numerical, determine whether the variable is discrete or

continuous.

a. Number of cellphones in the household

b. Monthly data usage (in MB)

c. Number of text messages exchanged per month

d. Voice usage per month (in minutes)

e. Whether the cellphone is used for email

1.5 The following information is collected

Question Responses Variable Type

Do you have a Facebook

profile?

❑ Yes ❑ No Categorical

How many text messages have

you sent in the past three days?

______ Numerical

(discrete)

How long did the mobile app

update take to download?

______ seconds Numerical

(continuous)

Problems for Section 1.1

Learning the Basics

1.1 Four different beverages are sold at a fast-food restaurant:

soft drinks, tea, coffee, and bottled water. Explain why the

type of beverage sold is an example of a categorical variable.

1.2 U.S. businesses are listed by size: small, medium, and large. Explain

why business size is an example of a categorical variable.

1.3 The time it takes to download a video from the Internet is

measured. Explain why the download time is a continuous

numerical variable.

Applying the Concepts

SELF

Test

1.4 For each of the following variables, determine

whether the variable is categorical or numerical. If the

variable is numerical, determine whether the variable is discrete or

continuous.

a. Number of cellphones in the household

b. Monthly data usage (in MB)

c. Number of text messages exchanged per month

d. Voice usage per month (in minutes)

e. Whether the cellphone is used for email

1.5 The following information is collected from students upon

exiting the campus bookstore during the first week of classes.

a. Amount of time spent shopping in the bookstore

b. Number of textbooks purchased

c. Academic major

d. Gender

Classify each of these variables as categorical or numerical. If the

variable is numerical, determine whether the variable is discrete or

continuous.

1.6 For each of the following variables, determine whether the

variable is categorical or numerical. If the variable is numerical,

determine whether the variable is discrete or continuous.

a. Name of Internet service provider

b. Time, in hours, spent surfing the Internet per week

c. Whether the individual uses a mobile phone to connect to the

Internet

d. Number of online purchases made in a month

e. Where the individual uses social networks to find sought-after

information

Learn More

Read the Short Takes for

Chapter 1 for more examples

of classifying variables

as either

categorical or numerical.

Ta ble 1 . 1

Identifying Types of

Variables

Question Responses Variable Type

Do you have a Facebook

profile?

❑ Yes ❑ No Categorical

How many text messages have

you sent in the past three days?

______ Numerical

(discrete)

How long did the mobile app

update take to download?

______ seconds Numerical

(continuous)

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.2 Collecting Data 13

1.2 Collecting Data

After defining the variables that you want to study, you can proceed with the data collection

task. Collecting data is a critical task because if you collect data that are flawed by biases,

ambiguities, or other types of errors, the results you will get from using such data with even

the most sophisticated statistical methods will be suspect or in error. (For a famous example of

flawed data collection leading to incorrect results, read the Think About This essay on page 21.)

Data collection consists of identifying data sources, deciding whether the data you collect

will be from a population or a sample, cleaning your data, and sometimes recoding variables.

The rest of this section explains these aspects of data collection.

Data Sources

You collect data from either primary or secondary data sources. You are using a primary data

source if you collect your own data for analysis. You are using a secondary data source if the

data for your analysis have been collected by someone else.

You collect data by using any of the following:

• Data distributed by an organization or individual

• The outcomes of a designed experiment

• The responses from a survey

• The results of conducting an observational study

• Data collected by ongoing business activities

Market research companies and trade associations distribute data pertaining to specific industries

or markets. Investment services provide business and financial data on publicly listed

companies. Syndicated services such as The Nielsen Company provide consumer research data to

telecom and mobile media companies. Print and online media companies also distribute data that

they may have collected themselves or may be republishing from other sources.

The outcomes of a designed experiment are a second data source. For example, a consumer

electronics company might conduct an experiment that compares the sales of mobile

electronics merchandise for different store locations. Note that developing a proper experimental

design is mostly beyond the scope of this book, but Chapter 10 discusses some of the

fundamental experimental design concepts.

Survey responses represent a third type of data source. People being surveyed are asked

questions about their beliefs, attitudes, behaviors, and other characteristics. For example,

people could be asked which store location for mobile electronics merchandise is preferable.

(Such a survey could lead to data that differ from the data collected from the outcomes of the

1.7 For each of the following variables, determine whether the

variable is categorical or numerical. If the variable is numerical,

determine whether the variable is discrete or continuous.

a. Amount of money spent on clothing in the past month

b. Favorite department store

c. Most likely time period during which shopping for clothing

takes place (weekday, weeknight, or weekend)

d. Number of pairs of shoes owned

1.8 Suppose the following information is collected from Robert

Keeler on his application for a home mortgage loan at the Metro

County Savings and Loan Association.

a. Monthly payments: $2,227

b. Number of jobs in past 10 years: 1

c. Annual family income: $96,000

d. Marital status: Married

Classify each of the responses by type of data.

1.9 One of the variables most often included in surveys is income.

Sometimes the question is phrased “What is your income

(in thousands of dollars)?” In other surveys, the respondent is

asked to “Select the circle corresponding to your income level”

and is given a number of income ranges to choose from.

a. In the first format, explain why income might be considered

either discrete or continuous.

b. Which of these two formats would you prefer to use if you

were conducting a survey? Why?

1.10 If two students score a 90 on the same examination,

what arguments could be used to show that the underlying

variable—test score—is continuous?

1.11 The director of market research at a large department store

chain wanted to conduct a survey throughout a metropolitan area

to determine the amount of time working women spend shopping

for clothing in a typical month.

a. Indicate the type of data the director might want to collect.

b. Develop a first draft of the questionnaire needed in (a) by writing

three categorical questions and three numerical questions

that you feel would be appropriate for this survey

One of the variables most often included in surveys is income.

Sometimes the question is phrased “What is your income

1.2 Collecting Data

After defining the variables that you want to study, you can proceed with the data collection

task. Collecting data is a critical task because if you collect data that are flawed by biases,

ambiguities, or other types of errors, the results you will get from using such data with even

the most sophisticated statistical methods will be suspect or in error. (For a famous example of

flawed data collection leading to incorrect results, read the Think About This essay on page 21.)

Data collection consists of identifying data sources, deciding whether the data you collect

will be from a population or a sample, cleaning your data, and sometimes recoding variables.

The rest of this section explains these aspects of data collection.

Data Sources

You collect data from either primary or secondary data sources. You are using a primary data

source if you collect your own data for analysis. You are using a secondary data source if the

data for your analysis have been collected by someone else.

You collect data by using any of the following:

• Data distributed by an organization or individual

• The outcomes of a designed experiment

• The responses from a survey

• The results of conducting an observational study

• Data collected by ongoing business activities

Market research companies and trade associations distribute data pertaining to specific industries

or markets. Investment services provide business and financial data on publicly listed

companies. Syndicated services such as The Nielsen Company provide consumer research data to

telecom and mobile media companies. Print and online media companies also distribute data that

they may have collected themselves or may be republishing from other sources.

The outcomes of a designed experiment are a second data source. For example, a consumer

electronics company might conduct an experiment that compares the sales of mobile

electronics merchandise for different store locations. Note that developing a proper experimental

design is mostly beyond the scope of this book, but Chapter 10 discusses some of the

fundamental experimental design concepts.

Survey responses represent a third type of data source. People being surveyed are asked

questions about their beliefs, attitudes, behaviors, and other characteristics. For example,

people could be asked which store location for mobile electronics merchandise is preferable.

(Such a survey could lead to data that differ from the data collected from the outcomes of the

1.7 For each of the following variables, determine whether the

variable is categorical or numerical. If the variable is numerical,

determine whether the variable is discrete or continuous.

a. Amount of money spent on clothing in the past month

b. Favorite department store

c. Most likely time period during which shopping for clothing

takes place (weekday, weeknight, or weekend)

d. Number of pairs of shoes owned

1.8 Suppose the following information is collected from Robert

Keeler on his application for a home mortgage loan at the Metro

County Savings and Loan Association.

a. Monthly payments: $2,227

b. Number of jobs in past 10 years: 1

c. Annual family income: $96,000

d. Marital status: Married

Classify each of the responses by type of data.

1.9 One of the variables most often included in surveys is income.

Sometimes the question is phrased “What is your income

(in thousands of dollars)?” In other surveys, the respondent is

asked to “Select the circle corresponding to your income level”

and is given a number of income ranges to choose from.

a. In the first format, explain why income might be considered

either discrete or continuous.

b. Which of these two formats would you prefer to use if you

were conducting a survey? Why?

1.10 If two students score a 90 on the same examination,

what arguments could be used to show that the underlying

variable—test score—is continuous?

1.11 The director of market research at a large department store

chain wanted to conduct a survey throughout a metropolitan area

to determine the amount of time working women spend shopping

for clothing in a typical month.

a. Indicate the type of data the director might want to collect.

b. Develop a first draft of the questionnaire needed in (a) by writing

three categorical questions and three numerical questions

that you feel would be appropriate for this survey.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

14 Chapter 1 Defining and Collecting Data

designed experiment of the previous paragraph.) Surveys can be affected by any of the four

types of errors that are discussed in Section 1.4.

Observational study results are a fourth data source. A researcher collects data by directly

observing a behavior, usually in a natural or neutral setting. Observational studies are a common

tool for data collection in business. For example, market researchers use focus groups

to elicit unstructured responses to open-ended questions posed by a moderator to a target audience.

Observational studies are also commonly used to enhance teamwork or improve the

quality of products and services.

Data collected by ongoing business activities are a fifth data source. Such data can be

collected from operational and transactional systems that exist in both physical “bricks-andmortar”

and online settings but can also be gathered from secondary sources such as third-party

social media networks and online apps and website services that collect tracking and usage data.

For example, a bank might analyze a decade’s worth of financial transaction data to identify

patterns of fraud, and a marketer might use tracking data to determine the effectiveness of a

website.

Sources for big data (see Section GS.3) tend to be a mix of primary and secondary sources

of this last type. For example, a retailer interested in increasing sales might mine Facebook

and

Twitter accounts to identify sentiment about certain products or to pinpoint top influencers and

then match those data to its own data collected during customer transactions.

Populations and Samples

You collect your data from either a population or a sample. A population consists of all the

items or individuals about which you want to reach conclusions. All the GT&M sales transactions

for a specific year, all the full-time students enrolled in a college, and all the registered

voters in Ohio are examples of populations. In Chapter 3, you will learn that when you analyze

data from a population you compute parameters.

sample is a portion of a population selected for analysis. The results of analyzing a

sample are used to estimate characteristics of the entire population. From the three examples

of populations just given, you could select a sample of 200 GT&M sales transactions randomly

selected by an auditor for study, a sample of 50 full-time students selected for a marketing

study, and a sample of 500 registered voters in Ohio contacted via telephone for a political

poll. In each of these examples, the transactions or people in the sample represent a portion of

the items or individuals that make up the population. In Chapter 3, you will learn that when

you analyze data from a sample you compute statistics .

You collect data from a sample when any of the following applies:

• Selecting a sample is less time consuming than selecting every item in the population.

• Selecting a sample is less costly than selecting every item in the population.

• Analyzing a sample is less cumbersome and more practical than analyzing the entire

population.

Structured Versus Unstructured Data

The data you collect may be formatted in a variety of ways, some of which add to the data

collection task. For example, suppose that you wanted to collect electronic financial data

about a sample of companies. That data might exist as tables of data, the contents of standardized

documents such as fill-in-the-blank surveys, a continuous stream of data such as a

stock ticker, or text messages or emails delivered from email systems or social media websites.

Some of these forms, such as a set of text messages have very little or no repeating

structure, are examples of unstructured data. Although unstructured data forms can form a

part of a big data collection,

collecting data in unstructured forms for the statistical methods

discussed in this book requires conversion of the data to a structured form. For example,

after collecting text messages,

you could convert their contents to a structured form by defining

a set of variables that might include a numerical variable that counts the number of

words in the message and various categorical variables that help classify the content of the

message.

Learn More

Read the Short Takes

for Chapter 1 for a further

discussion about data

sources.

Student Tip

To help remember the

difference between a

sample and a population,

think of a pie. The

entire pie represents the

population, and the pie

slice that you select is

the sample.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.2 Collecting Data 15

Electronic Formats and Encodings

The same form of data can exist in more than one electronic format, with some formats more

immediately usable than others. For example, a table of data might exist as a scanned image

or as data in a worksheet file. The worksheet data could be immediately used in a statistical

analysis, but the scanned image would need to be first converted to worksheet data using a

character-scanning program that can recognize numbers in an image.

Data can also be encoded in more than one way, as you may have learned in an information

systems course. Different encodings may affect the recorded precision of values for

continuous variables and lead to values more imprecise or values that convey a false sense of

precision, such as a time measurement that gets encoded in ten-thousandths of a second when

the original measurement was only in tenths of a second. This changed precision can violate

the operational definition of a continuous variable and sometimes affect results calculated.

Data Cleaning

Whatever ways you choose to collect data, you may find irregularities in the values you collect

such as undefined or impossible values. For a categorical variable, an undefined value would

be a value that does not represent one of the categories defined for the variable. For a numerical

variable, an impossible value would be a value that falls outside a defined range of possible

values for the variable. For a numerical variable without a defined range of possible values,

you might also find outliers, values that seem excessively different from most of the rest of the

values. Such values may or may not be errors, but they demand a second review.

Values that are missing are another type of irregularity. A missing value is a value that was

not able to be collected (and therefore not available to be analyzed). For example, you would

record a nonresponse to a survey question as a missing value. You can represent missing values

in Minitab by using an asterisk value for a numerical variable or by using a blank value for a

categorical variable, and such values will be properly excluded from analysis. The more limited

Excel has no special values that represent a missing value. When using Excel, you must

find and then exclude missing values manually.

When you spot an irregularity in the data you have collected, you may have to “clean” the

data. Although a full discussion of data cleaning is beyond the scope of this book (see reference

8), you can learn more about the ways you can use Excel or Minitab for data cleaning in

the Short Takes for Chapter 1.

Recoding Variables

After you have collected data, you may discover that you need to reconsider the categories that

you have defined for a categorical variable or that you need to transform a numerical variable

into a categorical variable by assigning the individual numeric data values to one of several

groups. In either case, you can define a recoded variable that supplements or replaces the

original variable in your analysis.

For example, having already defined the variable undergraduate class standing with the categories

freshmen, sophomore, junior, and senior, you realize that you are more interested in investigating

the differences between lowerclassmen (defined as freshman or sophomore) and upperclassmen

(junior or senior). You can create a new variable UpperLower and assign the value Upper if a

student

is a junior or senior and assign the value Lower if the student is a freshman or sophomore.

When recoding variables, be sure that the category definitions cause each data value to

be placed in one and only one category, a property known as being mutually exclusive. Also

ensure that the set of categories you create for the new, recoded variables include all the data

values being recoded, a property known as being collectively exhaustive. If you are recoding

a categorical variable, you can preserve one or more of the original categories, as long as your

recodings are both mutually exclusive and collectively exhaustive.

When recoding numerical variables, pay particular attention to the operational definitions

of the categories you create for the recoded variable, especially if the categories are not selfdefining

ranges. For example, while the recoded categories Under 12, 12–20, 21–34, 35–54,

and 55 and Over are self-defining for age, the categories Child, Youth, Young Adult, Middle

Aged, and Senior need their own operational definitions.

Student Tip

While encoding issues

go beyond the scope

of this book, the Short

Takes for Chapter 1

includes an experiment

that you can perform in

either Microsoft Excel

or Minitab that illustrates

how data encoding can

affect the precision of

values.

Data cleaning will not be

necessary when you use the

(previously cleaned) data for

the examples and problems

in this book.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

16 Chapter 1 Defining and Collecting Data

Problems for Section 1.2

Applying the Concepts

1.12 The Data and Story Library (DASL) is an online library of

data files and stories that illustrate the use of basic statistical methods.

Visit lib.stat.cmu.edu/index.php, click DASL, and explore a

data set of interest to you. Which of the five sources of data best

describes the sources of the data set you selected?

1.13 Visit the website of the Gallup organization at www.gallup

.com. Read today’s top story. What type of data source is the top

story based on?

1.14 Visit the website of the Pew Research organization at www

.pewresearch.org. Read today’s top story. What type of data

source is the top story based on?

1.15 Transportation engineers and planners want to address the

dynamic properties of travel behavior by describing in detail the

driving characteristics of drivers over the course of a month. What

type of data collection source do you think the transportation engineers

and planners should use?

1.16 Visit the opening page of the Statistics Portal “Statista” at

(statista.com). Examine the “CHART OF THE DAY” panel on

the page. What type of data source is the information presented

here based on?

When you collect data by selecting a sample, you begin by defining the frame. The frame is

a complete or partial listing of the items that make up the population from which the sample

will be selected. Inaccurate or biased results can occur if a frame excludes certain groups, or

portions of the population. Using different frames to collect data can lead to different, even opposite,

conclusions.

Using your frame, you select either a nonprobability sample or a probability sample. In

nonprobability sample, you select the items or individuals without knowing their probabilities

of selection. In a probability sample, you select items based on known probabilities.

Whenever possible, you should use a probability sample as such a sample will allow you to

make inferences about the population being analyzed.

Nonprobability samples can have certain advantages, such as convenience, speed, and low

cost. Such samples are typically used to obtain informal approximations or as small-scale initial

or pilot analyses. However, because the theory of statistical inference depends on probability

sampling, nonprobability samples cannot be used for statistical inference and this more

than offsets those advantages in more formal analyses.

Figure 1.1 shows the subcategories of the two types of sampling. A nonprobability sample

can be either a convenience sample or a judgment sample. To collect a convenience sample,

you select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse

of stacked items, selecting only the items located on the tops of each stack and within

easy reach would create a convenience sample. So, too, would be the responses to surveys that

the websites of many companies offer visitors. While such surveys can provide large amounts

of data quickly and inexpensively, the convenience samples selected from these responses will

consist of self-selected website visitors. (Read the Think About This essay on page 21 for a

related story.)

1.3 Types of Sampling Methods

F i g u r e 1 . 1

Types of samples

Nonprobability Samples

Judgment

Sample

Systematic

Sample

Stratied

Sample

Simple

Random

Sample

Cluster

Sample

Probability Samples

Convenience

Sample

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.3 Types of Sampling Methods 17

To collect a judgment sample, you collect the opinions of preselected experts

To collect a judgment sample, you collect the opinions of preselected experts in the subject

matter. Although the experts may be well informed, you cannot generalize their results to

the population.

The types of probability samples most commonly used include simple random, systematic,

stratified, and cluster samples. These four types of probability samples vary in terms of

cost, accuracy, and complexity, and they are the subject of the rest of this section.

Simple Random Sample

In a simple random sample, every item from a frame has the same chance of selection as every

other item, and every sample of a fixed size has the same chance of selection as every other

sample of that size. Simple random sampling is the most elementary random sampling technique.

It forms the basis for the other random sampling techniques. However, simple random

sampling has its disadvantages. Its results are often subject to more variation than other sampling

methods. In addition, when the frame used is very large, carrying out a simple random

sample may be time consuming and expensive.

With simple random sampling, you use to represent the sample size and to represent

the frame size. You number every item in the frame from 1 to N. The chance that you will select

any particular member of the frame on the first selection is 1>N.

You select samples with replacement or without replacement. Sampling with replacement

means that after you select an item, you return it to the frame, where it has the same

probability of being selected again. Imagine that you have a fishbowl containing business

cards, one card for each person. On the first selection, you select the card for Grace Kim. You

record pertinent information and replace the business card in the bowl. You then mix up the

cards in the bowl and select a second card. On the second selection, Grace Kim has the same

probability of being selected again, 1>N. You repeat this process until you have selected the

desired sample size, n.

Typically, you do not want the same item or individual to be selected again in a sample.

Sampling without replacement means that once you select an item, you cannot select

it again. The chance that you will select any particular item in the frame—for example, the

business card for Grace Kim—on the first selection is 1>N. The chance that you will select any

card not previously chosen on the second selection is now 1 out of – 1. This process continues

until you have selected the desired sample of size n.

When creating a simple random sample, you should avoid the “fishbowl” method of selecting

a sample because this method lacks the ability to thoroughly mix the cards and, therefore,

randomly select a sample. You should use a more rigorous selection method.

One such method is to use a table of random numbers, such as Table E.1 in Appendix E,

for selecting the sample. A table of random numbers consists of a series of digits listed in

a randomly generated sequence. To use a random number table for selecting a sample, you

first need to assign code numbers to the individual items of the frame. Then you generate the

random sample by reading the table of random numbers and selecting those individuals from

the frame whose assigned code numbers match the digits found in the table. Because the number

system uses 10 digits 10, 1, 2,c, 92, the chance that you will randomly generate any

particular digit is equal to the probability of generating any other digit. This probability is 1

out of 10. Hence, if you generate a sequence of 800 digits, you would expect about 80 to be the

digit 0, 80 to be the digit 1, and so on. Because every digit or sequence of digits in the table is

random, the table can be read either horizontally or vertically. The margins of the table designate

row numbers and column numbers. The digits themselves are grouped into sequences of

five in order to make reading the table easier.

Learn More

Learn to use a table of

random numbers to select a

simple random sample in a

Chapter 1 online section.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

18 Chapter 1 Defining and Collecting Data

Systematic Sample

In a systematic sample, you partition the items in the frame into groups of items, where

=

N

n

You round to the nearest integer. To select a systematic sample, you choose the first item to

be selected at random from the first items in the frame. Then, you select the remaining – 1

items by taking every kth item thereafter from the entire frame.

If the frame consists of a list of prenumbered checks, sales receipts, or invoices, taking a

systematic sample is faster and easier than taking a simple random sample. A systematic sample

is also a convenient mechanism for collecting data from membership directories, electoral

registers, class rosters, and consecutive items coming off an assembly line.

To take a systematic sample of = 40 from the population of = 800 full-time employees,

you partition the frame of 800 into 40 groups, each of which contains 20 employees. You

then select a random number from the first 20 individuals and include every twentieth individual

after the first selection in the sample. For example, if the first random number you select

is 008, your subsequent selections are 028, 048, 068, 088, 108,c, 768, and 788.

Simple random sampling and systematic sampling are simpler than other, more sophisticated,

probability sampling methods, but they generally require a larger sample size. In addition,

systematic sampling is prone to selection bias that can occur when there is a pattern in

the frame. To overcome the inefficiency of simple random sampling and the potential selection

bias involved with systematic sampling, you can use either stratified sampling methods or

cluster sampling methods.

Stratified Sample

In a stratified sample, you first subdivide the items in the frame into separate subpopulations,

or strata. A stratum is defined by some common characteristic, such as gender or year

in school. You select a simple random sample within each of the strata and combine the results

from the separate simple random samples. Stratified sampling is more efficient than either

simple random sampling or systematic sampling because you are ensured of the representation

of items across the entire population. The homogeneity of items within each stratum provides

greater precision in the estimates of underlying population parameters. In addition, stratified

sampling enables you to reach conclusions about each strata in the frame. However, using a

stratified sample requires that you can determine the variable(s) on which to base the stratification

and can also be expensive to implement.

Cluster Sample

In a cluster sample, you divide the items in the frame into clusters that contain several

items. Clusters are often naturally occurring groups, such as counties, election districts, city

blocks, households, or sales territories. You then take a random sample of one or more clusters

and study all items in each selected cluster.

Cluster sampling is often more cost-effective than simple random sampling, particularly

if the population is spread over a wide geographic region. However, cluster sampling often requires

a larger sample size to produce results as precise as those from simple random sampling

or stratified sampling. A detailed discussion of systematic sampling, stratified sampling, and

cluster sampling procedures can be found in references 2, 4, and 5.

Learn More

Learn how to select a

stratified sample in a

Chapter 1 online section.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.4 Types of Survey Errors 19

Problems for Section 1.3

Learning the Basics

1.17 For a population containing = 902 individuals, what

code number would you assign for

a. the first person on the list?

b. the fortieth person on the list?

c. the last person on the list?

1.18 For a population of = 902, verify that by starting in row 05,

column 01 of the table of random numbers (Table E.1), you need only

six rows to select a sample of = 60 without replacement.

1.19 Given a population of = 93, starting in row 29, column 01

of the table of random numbers (Table E.1), and reading across the

row, select a sample of = 15

a. without replacement.

b. with replacement.

Applying the Concepts

1.20 For a study that consists of personal interviews with participants

(rather than mail or phone surveys), explain why simple random

sampling might be less practical than some other sampling methods.

1.21 You want to select a random sample of = 1 from a population

of three items (which are called AB, and C). The rule for

selecting the sample is as follows: Flip a coin; if it is heads, pick

item A; if it is tails, flip the coin again; this time, if it is heads,

choose B; if it is tails, choose C. Explain why this is a probability

sample but not a simple random sample.

1.22 A population has four members (called ABC, and D). You

would like to select a random sample of = 2, which you decide

to do in the following way: Flip a coin; if it is heads, the sample will

be items and B; if it is tails, the sample will be items and D.

Although this is a random sample, it is not a simple random sample.

Explain why. (Compare the procedure described in Problem

1.21 with the procedure described in this problem.)

1.23 The registrar of a university with a population of = 4,000

full-time students is asked by the president to conduct a survey

to measure satisfaction with the quality of life on campus. The

following table contains a breakdown of the 4,000 registered

full-time students, by gender and class designation:

The registrar intends to take a probability sample of = 200 students

and project the results from the sample to the entire population

of full-time students.

a. If the frame available from the registrar’s files is an alphabetical

listing of the names of all = 4,000 registered full-time

students, what type of sample could you take? Discuss.

b. What is the advantage of selecting a simple random sample

in (a)?

c. What is the advantage of selecting a systematic sample in (a)?

d. If the frame available from the registrar’s files is a list of the

names of all = 4,000 registered full-time students compiled

from eight separate alphabetical lists, based on the gender and

class designation breakdowns shown in the class designation

table, what type of sample should you take? Discuss.

e. Suppose that each of the = 4,000 registered full-time students

lived in one of the 10 campus dormitories. Each dormitory

accommodates 400 students. It is college policy to fully

integrate students by gender and class designation in each dormitory.

If the registrar is able to compile a listing of all students

by dormitory, explain how you could take a cluster sample.

SELF

Test

1.24 Prenumbered sales invoices are kept in a

sales journal. The invoices are numbered from 0001

to 5000.

a. Beginning in row 16, column 01, and proceeding horizontally

in a table of random numbers (Table E.1), select a simple random

sample of 50 invoice numbers.

b. Select a systematic sample of 50 invoice numbers. Use the random

numbers in row 20, columns 05–07, as the starting point

for your selection.

c. Are the invoices selected in (a) the same as those selected in

(b)? Why or why not?

1.25 Suppose that 10,000 customers in a retailer’s customer database

are categorized by three customer types: 3,500 prospective

buyers, 4,500 first time buyers, and 2,000 repeat (loyal) buyers.

A sample of 1,000 customers is needed.

a. What type of sampling should you do? Why?

b. Explain how you would carry out the sampling according to the

method stated in (a).

c. Why is the sampling in (a) not simple random sampling?

Class Designation

Gender Fr. So. Jr. Sr. Total

Female 700 520 500 480 2,200

Male 560 460 400 380 1,800

Total 1,260 980 900 860 4,000

1.4 Types of Survey Errors

As you learned in Section 1.2, responses from a survey represent a source of data. Nearly

every day, you read or hear about survey or opinion poll results in newspapers, on the

Internet, or on radio or television. To identify surveys that lack objectivity or credibility,

you must critically evaluate what you read and hear by examining the validity of the survey

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

20 Chapter 1 Defining and Collecting Data

results. First, you must evaluate the purpose of the survey, why it was conducted, and for

whom it was conducted.

The second step in evaluating the validity of a survey is to determine whether it was based

on a probability or nonprobability sample (as discussed in Section 1.3). You need to remember

that the only way to make valid statistical inferences from a sample to a population is by using

a probability sample. Surveys that use nonprobability sampling methods are subject to serious

biases that may make the results meaningless.

Even when surveys use probability sampling methods, they are subject to four types of

potential survey errors:

• Coverage error

• Nonresponse error

• Sampling error

• Measurement error

Well-designed surveys reduce or minimize these four types of errors, often at considerable cost.

Coverage Error

The key to proper sample selection is having an adequate frame. Coverage error occurs if

certain groups of items are excluded from the frame so that they have no chance of being selected

in the sample or if items are included from outside the frame. Coverage error results in

selection bias. If the frame is inadequate because certain groups of items in the population

were not properly included, any probability sample selected will provide only an estimate of

the characteristics of the frame, not the actual population.

Nonresponse Error

Not everyone is willing to respond to a survey. Nonresponse error arises from failure to collect

data on all items in the sample and results in a nonresponse bias. Because you cannot always

assume that persons who do not respond to surveys are similar to those who do, you need

to follow up on the nonresponses after a specified period of time. You should make several

attempts to convince such individuals to complete the survey and possibly offer an incentive

to participate. The follow-up responses are then compared to the initial responses in order to

make valid inferences from the survey (see references 2, 4, and 5). The mode of response you

use, such as face-to-face interview, telephone interview, paper questionnaire, or computerized

questionnaire, affects the rate of response. Personal interviews and telephone interviews usually

produce a higher response rate than do mail surveys—but at a higher cost.

Sampling Error

When conducting a probability sample, chance dictates which individuals or items will or will

not be included in the sample. Sampling error reflects the variation, or “chance differences,”

from sample to sample, based on the probability of particular individuals or items being selected

in the particular samples.

When you read about the results of surveys or polls in newspapers or on the Internet, there

is often a statement regarding a margin of error, such as “the results of this poll are expected

to be within {4 percentage points of the actual value.” This margin of error is the sampling

error.

You can reduce sampling error by using larger sample sizes. Of course, doing so increases

the cost of conducting the survey.

Measurement Error

In the practice of good survey research, you design surveys with the intention of gathering

meaningful and accurate information. Unfortunately, the survey results you get are often only a

proxy for the ones you really desire. Unlike height or weight, certain information about behaviors

and psychological states is impossible or impractical to obtain directly.

When surveys rely on self-reported information, the mode of data collection, the respondent

to the survey, and or the survey itself can be possible sources of measurement error.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

1.4 Types of Survey Errors 21

Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent

on the mode of data collection. The social desirability bias or cognitive/memory limitations

of a respondent can affect the results. And vague questions, double-barreled questions

that ask about multiple issues but require a single response, or questions that ask the

respondent to report something that occurs over time but fail to clearly define the extent

of time about which the question asks (the reference period) are some of the survey flaws

that can cause errors.

To minimize measurement error, you need to standardize survey administration and respondent

understanding of questions, but there are many barriers to this (see references 1, 3,

and 10).

Ethical Issues About Surveys

Ethical considerations arise with respect to the four types of survey error. Coverage error

can result in selection bias and becomes an ethical issue if particular groups or individuals

are purposely excluded from the frame so that the survey results are more favorable to the

survey’s sponsor. Nonresponse error can lead to nonresponse bias and becomes an ethical

issue if the sponsor knowingly designs the survey so that particular groups or individuals

are less likely than others to respond. Sampling error becomes an ethical issue if the findings

are purposely presented without reference to sample size and margin of error so that

T h i n k About T h i s New Media Surveys/Old Sampling Problems

A software company executive decided to create

a “customer experience improvement program” to

record how customers use its products, with the

goal of using the collected data to make product

enhancements. An editor of a news website decides

to create an instant poll to ask website visitors

about important political issues. A marketer of

products aimed at a specific demographic decides

to use a social networking site to collect consumer

feedback. What do these decisions have in common

with a dead-tree publication that went out of

business over 70 years ago?

By 1932, long before the Internet, “straw

polls” conducted by the magazine Literary Digest

had successfully predicted five U.S. presidential

elections in a row. For the 1936 election, the

magazine promised its largest poll ever and sent

about 10 million ballots to people all across the

country. After receiving and tabulating more than

2.3 million ballots, the Digest confidently proclaimed

that Alf Landon would be an easy winner

over Franklin D. Roosevelt. As things turned

out, FDR won in a landslide, with Landon receiving

the fewest electoral votes in U.S. history.

The reputation of Literary Digest was ruined; the

magazine would cease publication less than two

years later.

The failure of the Literary Digest poll was a

watershed event in the history of sample surveys

and polls. This failure refuted the notion that the

larger the sample is, the better. (Remember this

the next time someone complains about a political

survey’s “small” sample size.) The failure opened

the door to new and more modern methods of

sampling discussed in this chapter. Using the predecessors

of those methods, George Gallup, the

“Gallup” of the famous poll, and Elmo Roper, of the

eponymous reports, both first gained widespread

public notice for their correct “scientific” predictions

of the 1936 election.

The failed Literary Digest poll became fodder

for several postmortems, and the reason

for the failure became almost an urban legend.

Typically, the explanation is coverage error: The

ballots were sent mostly to “rich people,” and

this created a frame that excluded poorer citizens

(presumably more inclined to vote for the

Democrat Roosevelt than the Republican Landon).

However, later analyses suggest that this was not

true; instead, low rates of response (2.3 million

ballots represented less than 25% of the ballots

distributed) and/or nonresponse error (Roosevelt

voters were less likely to mail in a ballot than

Landon voters) were significant reasons for the

failure (see reference 9).

When Microsoft first revealed its Office

Ribbon interface, a manager explained how Microsoft

had applied data collected from its “Customer

Experience Improvement Program” to the user interface

redesign. This led others to speculate that

the data were biased toward beginners—who

might be less likely to decline participation in the

program—and that, in turn, had led Microsoft to

create a user interface that ended up perplexing

more experienced users. This was another case of

nonresponse error!

The editor’s instant poll mentioned earlier

is targeted to the visitors of the news website,

and the social network–based survey is aimed

at “friends” of a product; such polls can also

suffer

from nonresponse errors. Often, marketers

extol how much they “know” about survey

respondents,

thanks to data that can be collected

from a social network community. But no amount

of information about the respondents can tell

marketers

who the nonrespondents are. Therefore,

new media surveys fall prey to the same old

type of error that proved fatal to Literary Digest

way back when.

Today, companies establish formal surveys

based on probability sampling and go to great

lengths—and spend large sums—to deal with

coverage error, nonresponse error, sampling error,

and measurement error. Instant polling and tell-afriend

surveys can be interesting and fun, but they

are not replacements for the methods discussed in

this chapter.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

22 Chapter 1 Defining and Collecting Data

Problems for Section 1.4

Applying the Concepts

1.26 A survey indicates that the vast majority of college students

own their own personal computers. What information would you

want to know before you accepted the results of this survey?

1.27 A simple random sample of = 300 full-time employees

is selected from a company list containing the names of

all = 5,000 full-time employees in order to evaluate job

satisfaction.

a. Give an example of possible coverage error.

b. Give an example of possible nonresponse error.

c. Give an example of possible sampling error.

d. Give an example of possible measurement error.

SELF

Test

1.28 The results of a 2013 Adobe Systems study on

retail apps and buying habits reveal insights on perceptions

and attitudes toward mobile shopping using retail apps and

browsers, providing new direction for retailers to develop their

digital publishing strategies (adobe.ly/11gt8Rq). Increased consumer

interest in using shopping applications means retailers

must adapt to meet the rising expectations for specialized mobile

shopping experiences. The results indicate that tablet users (55%)

are almost twice as likely as smartphone users (28%) to use their

device to purchase products and services. The findings also reveal

that retail and catalog apps are rapidly catching up to mobile

browsers as a viable shopping channel: nearly half of all mobile

shoppers are interested in using apps instead of a mobile browser

(45% of tablet shoppers and 49% of smartphone shoppers). The

research is based on an online survey with a sample of 1,003 consumers.

Identify potential concerns with coverage, nonresponse,

sampling, and measurement errors.

1.29 A recent PwC Supply Global Chain survey indicated that

companies that acknowledge the supply chain as a strategic

asset achieve 70% higher performance (pwc.to/VaFpGz). The

“Leaders” in the survey point to next-generation supply chains,

which are fast, flexible, and responsive. They are more concerned

with skills that separate a company from the crowd: 51% say differentiating

capabilities is the real key to success. What additional

information would you want to know about the survey before you

accepted the results of the study?

1.30 A recent survey points to a next generation of consumers

seeking a more mobile TV experience. The 2013 KPMG

International Consumer Media Behavior study found that while

TV is still the most popular media activity with 88% of U.S.

consumers watching TV, a relatively high proportion of U.S. consumers,

14%, now prefer to watch TV via their mobile device or

tablet for greater flexibility (bit.ly/Wb8Jv9). What additional

information would you want to know about the survey before you

accepted the results of the study?

The analysts charged by GT&M CEO Emma Levia to

identify, define, and collect the data that would be helpful

in setting a price for Whitney Wireless have completed

their task. The group has identified a number of variables

to analyze. In the course of doing this work, the group realized

that most of the variables to study would be discrete

numerical variables based on data that (ac)counts the financials

of the business. These data would mostly be from the

primary source of

the business itself,

but some supplemental

variables

about economic conditions and other factors that might

affect the long-term prospects of the business might

come from a secondary data source, such as an economic

agency.

U s i n g S tat i s t i c s

Beginning of the End… Revisited

Tyler Olson/Shutterstock

the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement

error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading

questions that guide the respondent in a particular direction; (2) an interviewer, through

mannerisms and tone, purposely makes a respondent obligated to please the interviewer

or otherwise guides the respondent in a particular direction; or (3) a respondent willfully

provides false information.

Ethical issues also arise when the results of nonprobability samples are used to form conclusions

about the entire population. When you use a nonprobability sampling method, you

need to explain the sampling procedures and state that the results cannot be generalized beyond the sample

Problems for Section 1.4

Applying the Concepts

1.26 A survey indicates that the vast majority of college students

own their own personal computers. What information would you

want to know before you accepted the results of this survey?

1.27 A simple random sample of = 300 full-time employees

is selected from a company list containing the names of

all = 5,000 full-time employees in order to evaluate job

satisfaction.

a. Give an example of possible coverage error.

b. Give an example of possible nonresponse error.

c. Give an example of possible sampling error.

d. Give an example of possible measurement error.

SELF

Test

1.28 The results of a 2013 Adobe Systems study on

retail apps and buying habits reveal insights on perceptions

and attitudes toward mobile shopping using retail apps and

browsers, providing new direction for retailers to develop their

digital publishing strategies (adobe.ly/11gt8Rq). Increased consumer

interest in using shopping applications means retailers

must adapt to meet the rising expectations for specialized mobile

shopping experiences. The results indicate that tablet users (55%)

are almost twice as likely as smartphone users (28%) to use their

device to purchase products and services. The findings also reveal

that retail and catalog apps are rapidly catching up to mobile

browsers as a viable shopping channel: nearly half of all mobile

shoppers are interested in using apps instead of a mobile browser

(45% of tablet shoppers and 49% of smartphone shoppers). The

research is based on an online survey with a sample of 1,003 consumers.

Identify potential concerns with coverage, nonresponse,

sampling, and measurement errors.

1.29 A recent PwC Supply Global Chain survey indicated that

companies that acknowledge the supply chain as a strategic

asset achieve 70% higher performance (pwc.to/VaFpGz). The

“Leaders” in the survey point to next-generation supply chains,

which are fast, flexible, and responsive. They are more concerned

with skills that separate a company from the crowd: 51% say differentiating

capabilities is the real key to success. What additional

information would you want to know about the survey before you

accepted the results of the study?

1.30 A recent survey points to a next generation of consumers

seeking a more mobile TV experience. The 2013 KPMG

International Consumer Media Behavior study found that while

TV is still the most popular media activity with 88% of U.S.

consumers watching TV, a relatively high proportion of U.S. consumers,

14%, now prefer to watch TV via their mobile device or

tablet for greater flexibility (bit.ly/Wb8Jv9). What additional

information would you want to know about the survey before you

accepted the results of the study?

The analysts charged by GT&M CEO Emma Levia to

identify, define, and collect the data that would be helpful

in setting a price for Whitney Wireless have completed

their task. The group has identified a number of variables

to analyze. In the course of doing this work, the group realized

that most of the variables to study would be discrete

numerical variables based on data that (ac)counts the financials

of the business. These data would mostly be from the

primary source of

the business itself,

but some supplemental

variables

about economic conditions and other factors that might

affect the long-term prospects of the business might

come from a secondary data source, such as an economic

agency.

U s i n g S tat i s t i c s

Beginning of the End… Revisited

Tyler Olson/Shutterstock

the sponsor can promote a viewpoint that might otherwise be inappropriate. Measurement

error can become an ethical issue in one of three ways: (1) a survey sponsor chooses leading

questions that guide the respondent in a particular direction; (2) an interviewer, through

mannerisms and tone, purposely makes a respondent obligated to please the interviewer

or otherwise guides the respondent in a particular direction; or (3) a respondent willfully

provides false information.

Ethical issues also arise when the results of nonprobability samples are used to form conclusions

about the entire population. When you use a nonprobability sampling method, you

need to explain the sampling procedures and state that the results cannot be generalized beyond

the sample.

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

Summary

In this chapter, you learned about the various types of

variables used in business. In addition, you learned about

different methods of collecting data, several statistical

sampling methods, and issues involved

in taking samples.

In the next two chapters, you will study a variety of tables

and charts and descriptive measures that are used to present

and analyze data.

References

1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and

S. Sudman. Measurement Errors in Surveys. New York: Wiley

Interscience, 2004.

2. Cochran, W. G. Sampling Techniques, 3rd ed. New York:

Wiley,

1977.

3. Fowler, F. J. Improving Survey Questions: Design and Evaluation,

Applied Special Research Methods Series, Vol. 38,

Thousand

Oaks, CA: Sage Publications, 1995.

4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski,

E. Singer, and R. Tourangeau. Survey Methodology, 2nd ed.

New York: John Wiley, 2009.

5. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston,

MA: Brooks/Cole Cengage Learning, 2010.

6. Microsoft Excel 2013. Redmond, WA: Microsoft Corporation,

2012.

7. Minitab Release 16. State College, PA: Minitab, Inc., 2010.

8. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks,

CA: Sage Publications, 2012.

9. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public

Opinion Quarterly 52 (1988): 125–133.

10. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About

Answers: The Application of Cognitive Processes to Survey

Methodology. San Francisco, CA: Jossey-Bass, 1993.

Key Terms

categorical variable 11

cluster 18

cluster sample 18

collect 11

collectively exhaustive 15

continuous variable 12

convenience sample 16

coverage error 20

define 11

discrete variable 12

frame 16

judgment sample 17

margin of error 20

measurement error 20

missing value 15

mutually exclusive 15

nonprobability sample 16

nonresponse bias 20

nonresponse error 20

numerical variable 11

operational definition 11

outlier 15

parameter 14

population 14

primary data source 13

probability sample 16

qualitative variable 11

quantitative variable 11

recoded variable 15

sample 14

sampling error 20

sampling with replacement 17

sampling without replacement 17

secondary data source 13

selection bias 20

simple random sample 17

statistics 14

strata 18

stratified sample 18

systematic sample 18

table of random numbers 17

unstructured data 14

The group foresaw that examining several categorical variables

related to the customers of both GT&M and Whitney

Wireless would be necessary. The group discovered that the affinity

(“shopper’s card”) programs of both firms had already

collected demographic data of interest when customers enrolled

in those programs. That primary source, when combined

with secondary data gleaned from the social media networks

to which the business belongs, might prove useful in getting a

rough approximation of the profile of a typical customer that

might be interested in doing business with an “A-to-Z” electronic retailer.

In this chapter, you learned about the various types of

variables used in business. In addition, you learned about

different methods of collecting data, several statistical

sampling methods, and issues involved

in taking samples.

In the next two chapters, you will study a variety of tables

and charts and descriptive measures that are used to present

and analyze data.

References

1. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz, and

S. Sudman. Measurement Errors in Surveys. New York: Wiley

Interscience, 2004.

2. Cochran, W. G. Sampling Techniques, 3rd ed. New York:

Wiley,

1977.

3. Fowler, F. J. Improving Survey Questions: Design and Evaluation,

Applied Special Research Methods Series, Vol. 38,

Thousand

Oaks, CA: Sage Publications, 1995.

4. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski,

E. Singer, and R. Tourangeau. Survey Methodology, 2nd ed.

New York: John Wiley, 2009.

5. Lohr, S. L. Sampling Design and Analysis, 2nd ed. Boston,

MA: Brooks/Cole Cengage Learning, 2010.

6. Microsoft Excel 2013. Redmond, WA: Microsoft Corporation,

2012.

7. Minitab Release 16. State College, PA: Minitab, Inc., 2010.

8. Osbourne, J. Best Practices in Data Cleaning. Thousand Oaks,

CA: Sage Publications, 2012.

9. Squire, P. “Why the 1936 Literary Digest Poll Failed.” Public

Opinion Quarterly 52 (1988): 125–133.

10. Sudman, S., N. M. Bradburn, and N. Schwarz. Thinking About

Answers: The Application of Cognitive Processes to Survey

Methodology. San Francisco, CA: Jossey-Bass, 1993.

categorical variable 11

cluster 18

cluster sample 18

collect 11

collectively exhaustive 15

continuous variable 12

convenience sample 16

coverage error 20

define 11

discrete variable 12

frame 16

judgment sample 17

margin of error 20

measurement error 20

missing value 15

mutually exclusive 15

nonprobability sample 16

nonresponse bias 20

nonresponse error 20

numerical variable 11

operational definition 11

outlier 15

parameter 14

population 14

primary data source 13

probability sample 16

qualitative variable 11

quantitative variable 11

recoded variable 15

sample 14

sampling error 20

sampling with replacement 17

sampling without replacement 17

secondary data source 13

selection bias 20

simple random sample 17

statistics 14

strata 18

stratified sample 18

systematic sample 18

table of random numbers 17

unstructured data 14

The group foresaw that examining several categorical variables

related to the customers of both GT&M and Whitney

Wireless would be necessary. The group discovered that the affinity

(“shopper’s card”) programs of both firms had already

collected demographic data of interest when customers enrolled

in those programs. That primary source, when combined

with secondary data gleaned from the social media networks

to which the business belongs, might prove useful in getting a

rough approximation of the profile of a typical customer that

might be interested in doing business with an “A-to-Z” electronics

retailer.

Key Terms 23

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-0

24 Chapter 1 Defining and Collecting Data

Checking Your Understanding

1.31 What is the difference between a sample and a population?

1.32 What is the difference between a statistic and a parameter?

1.33 What is the difference between a categorical variable and a

numerical variable?

1.34 What is the difference between a discrete numerical variable

and a continuous numerical variable?

1.35 What is the difference between probability sampling and nonprobability

sampling?

Chapter Review Problems

1.36 Visit the official website for either Excel (www.office

.microsoft.com/excel) or Minitab (www.minitab.com/products

/minitab). Read about the program you chose and then think about the

ways the program could be useful in statistical analysis.

1.37 Results of a 2013 Adobe Systems study on retail apps and

buying habits reveals insights on perceptions and attitudes toward

mobile shopping using retail apps and browsers, providing new direction

for retailers to develop their digital publishing strategies.

Increased consumer interest in using shopping applications means

retailers must adapt to meet the rising expectations for specialized

mobile shopping experiences. The results indicate that tablet users

(55%) are almost twice as likely as smartphone users (28%) to

use their device to purchase products and services. The findings

also reveal that retail and catalog apps are rapidly catching up to

mobile browsers as a viable shopping channel: Nearly half of all

mobile shoppers are interested in using apps instead of a mobile

browser (45% of tablet shoppers and 49% of smartphone shoppers).

The research is based on an online survey with a sample

of 1,003 18–54 year olds who currently own a smartphone and/or

tablet; it includes consumers who use and do not use these devices

to shop (adobe.ly/11gt8Rq).

a. Describe the population of interest.

b. Describe the sample that was collected.

c. Describe a parameter of interest.

d. Describe the statistic used to estimate the parameter in (c).

1.38 The Gallup organization releases the results of recent polls

at its website, www.gallup.com. Visit this site and read an article

of interest.

a. Describe the population of interest.

b. Describe the sample that was collected.

c. Describe a parameter of interest.

d. Describe the statistic used to estimate the parameter in (c).

1.39 A recent PwC Supply Global Chain survey indicated that companies

that acknowledge the supply chain as a strategic asset achieve

70% higher performance. The “Leaders” in the survey point to nextgeneration

supply chains, which are fast, flexible, and responsive. They

are more concerned with skills that separate a company from the crowd:

51% say differentiating capabilities is the real key to success (pwc.to

/VaFpGz). The results are based on a survey of 503 supply chain

executives

in a wide range of industries representing a mix of company

sizes from across three global regions: Asia, Europe, and the

Americas.

a. Describe the population of interest.

b. Describe the sample that was collected.

c. Describe a parameter of interest.

d. Describe the statistic used to estimate the parameter in (c).

1.40 The Data and Story Library (DASL) is an online library of

data files and stories that illustrate the use of basic statistical methods.

Visit lib.stat.cmu.edu/index.php, click DASL, and explore a

data set of interest to you.

a. Describe a variable in the data set you selected.

b. Is the variable categorical or numerical?

c. If the variable is numerical, is it discrete or continuous?

1.41 Download and examine the U.S. Census Bureau’s “Business and

Professional Classification Survey (SQ-CLASS),” available through

the Get Help with Your Form link at www.census.gov/econ/.

a. Give an example of a categorical variable included in the survey.

b. Give an example of a numerical variable included in the survey.

1.42 Three professors examined awareness of four widely disseminated

retirement rules among employees at the University of Utah.

These rules provide simple answers to questions about retirement planning

(R. N. Mayer, C. D. Zick, and M. Glaittle, “Public Awareness of

Retirement Planning Rules of Thumb,” Journal of Personal Finance,

2011 10(1), 12–35). At the time of the investigation, there were approximately

10,000 benefited employees, and 3,095 participated in the

study. Demographic data collected on these 3,095 employees included

gender, age (years), education level (years completed), marital status,

household income ($), and employment category.

a. Describe the population of interest.

b. Describe the sample that was collected.

c. Indicate whether each of the demographic variables mentioned

is categorical or numerical.

1.43 A manufacturer of cat food is planning to survey households in the United States to determine purchasing habits of cat owners.

Among the variables to be collected are the following:

i. The primary place of purchase for cat food

ii. Whether dry or moist cat food is purchased

iii. The number of cats living in the household

iv. Whether any cat living in the household is pedigreed

a. For each of the four items listed, indicate whether the variable

is categorical or numerical. If it is numerical, is it discrete or

continuous?

b. Develop five categorical questions for the survey.

c. Develop five numerical questions for the survey.

Cases f o r Ch a p t e r 1

Managing Ashland MultiComm Services

Ashland MultiComm Services (AMS) provides high-quality

communications networks in the Greater Ashland

area. AMS traces its roots to Ashland Community Access

Television (ACATV), a small company that redistributed the

broadcast television signals from nearby major metropolitan

areas but has evolved into a provider of a wide range of

broadband services for residential customers.

AMS offers subscription-based services for digital cable

video programming, local and long-distance telephone

services, and high-speed Internet access. Recently, AMS has

faced competition from other network providers that have

expanded into the Ashland area. AMS has also seen decreases

in the number of new digital cable installations and

the rate of digital cable renewals.

AMS management believes that a combination of increased

promotional expenditures, adjustment in subscription

fees, and improved customer service will allow AMS

to successfully face the competition from other network

providers. However, AMS management worries about the

possible effects that new Internet-based methods of program

delivery may have had on their digital cable business. They

decide that they need to conduct some research and organize

a team of research specialists to examine the current status

of the business and the marketplace in which it competes.

The managers suggest that the research team examine

the company’s own historical data for number of subscribers,

revenues, and subscription renewal rates for the past

few years. They direct the team to examine year-to-date data

as well, as the managers suspect that some of the changes

they have seen have been a relatively recent phenomena.

1. What type of data source would the company’s own

historical

data be? Identify other possible data sources

that the research team might use to examine the current

marketplace for residential broadband services in a city

such as Ashland.

2. What type of data collection techniques might the team

employ?

3. In their suggestions and directions, the AMS managers

have named a number of possible variables to study, but

offered no operational definitions for those variables.

What types of possible misunderstandings could arise if

the team and managers do not first properly define each

variable cited?

CardioGood Fitness

CardioGood Fitness is a developer of high-quality cardiovascular

exercise equipment. Its products include treadmills,

fitness bikes, elliptical machines, and e-glides. CardioGood

Fitness looks to increase the sales of its treadmill products

and has hired The AdRight Agency, a small advertising

firm, to create and implement an advertising program. The

AdRight Agency plans to identify particular market segments

that are most likely to buy their clients’ goods and

services and then locates advertising outlets that will reach

that market group. This activity includes collecting data on

clients’ actual sales and on the customers who make the

purchases, with the goal of determining whether there is a

distinct profile of the typical customer for a particular product

or service. If a distinct profile emerges, efforts are made

to match that profile to advertising outlets known to reflect

the particular profile, thus targeting advertising directly to

high-potential customers.

CardioGood Fitness sells three different lines of treadmills.

The TM195 is an entry-level treadmill. It is as dependable

as other models offered by CardioGood Fitness,

but with fewer programs and features. It is suitable for individuals

who thrive on minimal programming and the desire

for simplicity to initiate their walk or hike. The TM195 sells

for $1,500.

The middle-line TM498 adds to the features of the

entry-level model two user programs and up to 15% elevation

upgrade. The TM498 is suitable for individuals who are

walkers at a transitional stage from walking to running or

midlevel runners. The TM498 sells for $1,750.

The top-of-the-line TM798 is structurally larger and

heavier and has more features than the other models. Its

unique features include a bright blue backlit LCD console,

quick speed and incline keys, a wireless heart rate monitor

with a telemetric chest strap, remote speed and incline controls,

and an anatomical figure that specifies which muscles

are minimally and maximally activated. This model features

a nonfolding platform base that is designed to handle rigorous,

frequent running; the TM798 is therefore appealing

to someone who is a power walker or a runner. The selling

price is $2,500.

As a first step, the market research team at AdRight is

assigned the task of identifying the profile of the typical

customer for each treadmill product offered by CardioGood

Fitness. The market research team decides to investigate

26 Chapter 1 Defining and Collecting Data

Clear Mountain State Student Surveys

1. The Student News Service at Clear Mountain State

University (CMSU) has decided to gather data about

the undergraduate students who attend CMSU. They

create and distribute a survey of 14 questions and

receive responses from 62 undergraduates (stored

in UndergradSurvey ). Download (see Appendix C) and

review the survey document CMUndergradSurvey

.pdf. For each question asked in the survey, determine

whether the variable is categorical or numerical. If

you determine that the variable is numerical, identify

whether it is discrete or continuous.

2. The dean of students at CMSU has learned about the

undergraduate

survey and has decided to undertake a similar

survey for graduate students at CMSU. She creates

and

distributes a survey of 14 questions and receives responses

from 44 graduate students (stored in GradSurvey ). Download

(see Appendix C) and review the survey document

CMGradSurvey.pdf. For each question asked in the survey,

determine whether the variable is categorical or numerical.

If you determine that the variable is numerical,

identify whether it is discrete or continuous.

whether there are differences across the product lines with

respect to customer characteristics. The team decides to collect

data on individuals who purchased a treadmill at a CardioGood

Fitness retail store during the prior three months.

The team decides to use both business transactional

data and the results of a personal profile survey that every

purchaser completes as their sources of data. The team

identifies the following customer variables to study: product

purchased—TM195, TM498, or TM798; gender; age,

in years; education, in years; relationship status, single or

partnered; annual household income ($); mean number

of times the customer plans to use the treadmill each week;

mean number of miles the customer expects to walk/run

each week; and self-rated fitness on an 1-to-5 scale, where

1 is poor shape and 5 is excellent shape. For this set of

variables:

1. Which variables in the survey are categorical?

2. Which variables in the survey are numerical?

3. Which variables are discrete numerical variables?

Learning with the Digital Cases

As you have already learned in this book, decision makers

use statistical methods to help analyze data and communicate

results. Every day, somewhere, someone misuses these

techniques either by accident or intentional choice. Identifying

and preventing such misuses of statistics is an important

responsibility for all managers. The Digital Cases give you

the practice you need to help develop the skills necessary

for this important task.

Each chapter’s Digital Case tests your understanding of

how to apply an important statistical concept taught in the

chapter. As in many business situations, not all of the information

you encounter will be relevant to your task, and you

may occasionally discover conflicting information that you

have to resolve in order to complete the case.

To assist your learning, each Digital Case begins with

a learning objective and a summary of the problem or issue

at hand. Each case directs you to the information necessary

to reach your own conclusions and to answer the case

questions. Many cases, such as the sample case worked out

next, extend a chapter’s Using Statistics scenario. You can

download digital case files for later use or retrieve them online

from a MyStatLab course for this book, as explained in

Appendix C.

To illustrate learning with a Digital Case, open the

Digital Case file WhitneyWireless.pdf that contains summary

information about the Whitney Wireless business.

Recall from the Using Statistics scenario for this chapter

that Good Tunes & More (GT&M) is a retailer seeking to

expand by purchasing Whitney Wireless, a small chain that

sells mobile media devices. Apparently, from the claim on

the title page, this business is celebrating its “best sales

year ever.”

Review the Who We AreWhat We Do, and What We

Plan to Do sections on the second page. Do these sections

contain any useful information? What questions does this

passage raise? Did you notice that while many facts are presented,

no data that would support the claim of “best sales

year ever” are presented? And were those mobile “mobilemobiles”

used solely for promotion? Or did they generate

any sales? Do you think that a talk-with-your-mouth-full

event, however novel, would be a success?

Continue to the third page and the Our Best Sales Year

Ever! section. How would you support such a claim? With

a table of numbers? Remarks attributed to a knowledgeable

source? Whitney Wireless has used a chart to present

“two years ago” and “latest twelve months” sales data by

category.

Are there any problems with what the company

has done? Absolutely!

First, note that there are no scales for the symbols

used, so you cannot know what the actual sales volumes

are. In fact, as you will learn in Section 2.7, charts that incorporate

icons as shown on the third page are considered

examples of chartjunk and would never be used by people

seeking to properly visualize data. The use of chartjunk

symbols creates the impression that unit sales data are being

presented. If the data are unit sales, does such data best

support the claim being made, or would something else,

such as dollar volumes, be a better indicator of sales at the

retailer?

For the moment, let’s assume that unit sales are being

visualized. What are you to make of the second row,

in which the three icons on the right side are much wider

than the three on the left? Does that row represent a newer

(wider) model being sold or a greater sales volume? Examine

the fourth row. Does that row represent a decline in sales

or an increase? (Do two partial icons represent more than

one whole icon?) As for the fifth row, what are we to think?

Is a black icon worth more than a red icon or vice versa?

At least the third row seems to tell some sort of tale of

increased sales, and the sixth row tells a tale of constant

sales. But what is the “story” about the seventh row? There,

the partial icon is so small that we have no idea what product

category the icon represents.

Perhaps a more serious issue is those curious chart labels.

“Latest twelve months” is ambiguous; it could include

months from the current year as well as months from one

year ago and therefore may not be an equivalent time period

to “two years ago.” But the business was established in 2001,

and the claim being made is “best sales year ever,” so why

hasn’t management included sales figures for every year?

Are the Whitney Wireless managers hiding something,

or are they just unaware of the proper use of statistics? Either

way, they have failed to properly organize and visualize

their data and therefore have failed to communicate a vital

aspect of their story.

In subsequent Digital Cases, you will be asked to provide

this type of analysis, using the open-ended case questions

as your guide. Not all the cases are as straightforward

as this example, and some cases include perfectly appropriate

applications of statistical methods.

EG1.1 Defining Variables

Classifying Variables by Type

Microsoft Excel infers the variable type from the data you enter

into a column. If Excel discovers a column that contains numbers,

it treats the column as a numerical variable. If Excel discovers a

column that contains words or alphanumeric entries, it treats the

column as a non-numerical (categorical) variable.

This imperfect method works most of the time, especially if

you make sure that the categories for your categorical variables are

words or phrases such as “yes” and “no.” However, because you

cannot explicitly define the variable type, Excel can mistakenly

offer or allow you to do nonsensical things such as using a statistical

method that is designed for numerical variables on categorical

variables. If you must use coded values such as 1, 2, or 3, enter

them preceded with an apostrophe, as Excel treats all values that

begin with an apostrophe as non-numerical data. (You can check

whether a cell entry includes a leading apostrophe by selecting a

cell and viewing the contents of the cell in the formula bar.)

EG1.2 Collecting Data

Recoding Variables

Key Technique To recode a categorical variable, you first copy

the original variable’s column of data and then use the find-andreplace

function on the copied data. To recode a numerical variable,

enter a formula that returns a recoded value in a new column.

Example Using the DATA worksheet of the Recoded workbook,

create the recoded variable UpperLower from the categorical

variable Class and create the recoded Variable Dean’s List

from the numerical variable GPA.

In-Depth Excel Use the RECODED worksheet of the

Recoded

workbook as a model.

The worksheet already contains UpperLower, a recoded version

of Class that uses the operational definitions on page 15, and

Dean’s List, a recoded version of GPA, in which the value No recodes

all GPA values less than 3.3 and Yes recodes all values 3.3

or greater than 3.3. The RECODED_FORMULAS worksheet in

the same workbook shows how formulas in column I use the IF

function to recode GPA as the Dean’s List variable.

These recoded variables were created by first opening to the

DATA worksheet in the same workbook and then following these

steps:

1. Right-click column (right-click over the shaded “D” at the

top of column D) and click Copy in the shortcut menu.

2. Right-click column and click the first choice in the Paste

Options gallery.

3. Enter UpperLower in cell H1.

4. Select column H. With column H selected, click Home 

Find & Select ➔ Replace.

In the Replace tab of the Find and Replace dialog box:

5. Enter Senior as Find whatUpper as Replace with, and

then click Replace All.

6. Click OK to close the dialog box that reports the results of

the replacement command.

7. Still in the Find and Replace dialog box, enter Junior as

Find what (replacing Senior), and then click Replace All.

8. Click OK to close the dialog box that reports the results of

the replacement command.

9. Still in the Find and Replace dialog box, enter Sophomore

as Find whatLower as Replace with, and then click

Replace All.

10. Click OK to close the dialog box that reports the results of

the replacement command.

11. Still in the Find and Replace dialog box, enter Freshman as

Find what and then click Replace All.

12. Click OK to close the dialog box that reports the results of

the replacement command.

(This creates the recoded variable UpperLower in column H.)

13. Enter Dean’s List in cell I1.

14. Enter the formula =IF(G2 < 3.3, “No”, “Yes”) in cell I2.

15. Copy this formula down the column to the last row that contains

student data (row 63).

(This creates the recoded variable Dean’s List in column I.)

The RECODED worksheet uses the IF function (See

Appendix F) to recode the numerical variable into two categories.

Numerical variables can also be recoded into multiple categories

by using the VLOOKUP function. Read the Short Takes for Chapter

1 to learn more about this advanced recoding technique.

EG1.3 Types of Sampling Methods

Simple Random Sample

Key Technique Use the RANDBETWEEN(smallest integer,

largest integerfunction to generate a random integer that can

then be used to select an item from a frame.

Example 1 Create a simple random sample with replacement of

size 40 from a population of 800 items.

In-Depth Excel Enter a formula that uses this function and

then copy the formula down a column for as many rows as is necessary.

For example, to create a simple random sample with replacement

of size 40 from a population of 800 items, open to a

new worksheet.

Enter Sample in cell A1 and enter the formula

=RANDBETWEEN(1, 800) in cell A2. Then copy the formula

down the column to cell A41.

Excel contains no functions to select a random sample without

replacement. Such samples are most easily created using an

add-in such as PHStat or the Analysis ToolPak, as described in the following paragraphs.

Analysis ToolPak Use Sampling to create a random sample

with replacement.

For the example, open to the worksheet that contains the population

of 800 items in column A and that contains a column heading

in cell A1. Select Data ➔ Data Analysis. In the Data Analysis

dialog box, select Sampling from the Analysis Tools list and then

click OK. In the procedure’s dialog box (shown below):

1. Enter A1:A801 as the Input Range and check Labels.

2. Click Random and enter 40 as the Number of Samples.

3. Click New Worksheet Ply and then click OK.

Example 2 Create a simple random sample without replacement

of size 40 from a population of 800 items.

PHStat Use Random Sample Generation.

For the example, select PHStat ➔ Sampling ➔ Random Sample

Generation. In the procedure’s dialog box (shown in next column):

1. Enter 40 as the Sample Size.

2. Click Generate list of random numbers and enter 800 as

the Population Size.

3. Enter a Title and click OK.

Unlike most other PHStat results worksheets, the worksheet created

contains no formulas.

In-Depth Excel Use the COMPUTE worksheet of the

Random

workbook as a template.

The worksheet already contains 40 copies of the formula

=RANDBETWEEN(1, 800) in column B. Because the

RANDBETWEEN function samples with replacement as discussed

at the start of this section, you may need to add additional copies of

the formula in new column B rows until you have 40 unique values.

If your intended sample size is large, you may find it difficult

to spot duplicates. Read the Short Takes for Chapter 1 to learn

more about an advanced technique that uses formulas to detect duplicate

values.

MG1.1 Defining Variables

Classifying Variables by Type

When Minitab adds a “-T” suffix to a column name, it is classifying

the column as a categorical, or text, variable. When Minitab

does not add a suffix, it is classifying the column as a numerical

variable. (A column name with the “-D” suffix is a date variable, a

special type of a numerical variable.)

Sometimes, Minitab will misclassify a variable, for example,

mistaking a numerical variable for a categorical (text) variable. In

such cases, select the column, then select Data ➔ Change Data

Type, and then select one of the choices, for example, Text to

Numeric

for the case of when Minitab has mistaken a numerical

variable as a categorical variable.

MG1.2 Collecting Data

Recoding Variables

Use the Replace command to recode a categorical variable and

Calculator to recode a numerical variable.

For example, to create the recoded variable UpperLower from

the categorical variable Class (C4-T), open to the DATA worksheet

of the Recode project and:

1. Select the Class column (C4-T).

2. Select Editor ➔ Replace.

In the Replace in Data Window dialog box:

3. Enter Senior as Find whatUpper as Replace with, and

then click Replace All.

4. Click OK to close the dialog box that reports the results of

the replacement command.

Ch a p t e r 1 M i n i ta b Gui d e

Business Statistics: A First Course, Seventh Edition, by David M. Levine, Kathryn A. Szabat, and David F. Stephan. Published by Pearson.

Copyright © 2016 by Pearson Education, Inc.

ISBN: 978-1-323-26258-

30 Chapter 1 Defining and Collecting Data

5. Still in the Find and Replace dialog box, enter Junior as

Find what (replacing Senior), and then click Replace All.

6. Click OK to close the dialog box that reports the results of

the replacement command.

7. Still in the Find and Replace dialog box, enter Sophomore as

Find whatLower as Replace with, and then click Replace All.

8. Click OK to close the dialog box that reports the results of

the replacement command.

9. Still in the Find and Replace dialog box, enter Freshman as

Find what, and then click Replace All.

10. Click OK to close the dialog box that reports the results of

the replacement command.

To create the recoded variable Dean’s List from the numerical

variable GPA (C7), with the DATA worksheet of the Recode project

still open:

1. Enter Dean’s List as the name of the empty column C8.

2. Select Calc ➔ Calculator.

In the Calculator dialog box (shown below):

3. Enter C8 in the Store result in variable box.

4. Enter IF(GPA < 3.3, “No”, “Yes”) in the Expression box.

5. Click OK.

Variables can also be recoded into multiple categories by using the

Data ➔ Code command. Read the Short Takes for Chapter 1 to

learn more about this advanced recoding technique.

MG1.3 Types of Sampling Methods

Simple Random Samples

Use Sample From Columns.

For example, to create a simple random sample with replacement

of size 40 from a population of 800 items, first create the list

of 800 employee numbers in column C1.

Select Calc ➔ Make Patterned Data ➔ Simple Set of Numbers.

In the Simple Set of Numbers dialog box (shown below):

1. Enter C1 in the Store patterned data in box.

2. Enter in the From first value box.

3. Enter 800 in the To last value box.