create new tag
, view all tags, tagging instructions
Return to Course Materials

Title: UCSF - Data Exploration

Start Presentation

Slide 1: Multiple Predictor Regression

  • Assess the relationship between an outcome and multiple predictors
  • Powerful tool for:
    • understanding complex relationships
    • controlling confounding
    • prediction / risk stratification
  • Regression models differ by outcome type, but all have much in common

Slide 2: Data Types


Slide 3: More on Data Type

  • Data type implies plausible probability distn
  • Different data summaries the sample mean not always interpretable
  • Distinctions between different types can be flexible

Slide 4: Model depends on outcome type

  • Continuous -- linear or gamma model
  • Discrete (counts)- Poisson/negative binomial
  • Binary -- logistic, relative risk models, survival models when follow-up varies
  • All easily implemented in Stata

Slide 5: Data Exploration

  • Find data errors
  • Assess missingness
  • Detect anomalous observations and outlying data values
  • Select appropriate analysis methods
  • Support a formal data analysis

Slide 6: Data Example

  • Western Collaborative Group Study
  • Large early observational study (n=3154)
  • Association between "type A" behavior and coronary heart disease (CHD)
  • Example variable: systolic blood pressure

Slide 7: Descriptive Output


Slide 8: Cholesterol Data with Outlier


Slide 9: Histograms


  • Shows location, spread, and shape of the distribution
  • Horizontal axis: intervals or "bins" in which data values are grouped
  • Vertical axis: number, fraction, or percent of the observations in each bin

Interpreting Histograms

  • Pattern of bar heights conveys shape of distribution:
    • number of modes
    • skewness
    • long or short tails
  • Usefulness depends on number of bins
    • too many defeats goal of summarization
    • too few obscures shape of distribution



Slide 10: Stata Commands

  • histogram varname to graph a histogram
  • histogram varname, bin(x) histogram with x bars
  • histogram varname, freq histogram with frequency not fractions

Slide 11: Boxplot

  • Box with upper & lower hinges
  • Box: 25% tile, median, 75% tile
  • Length of box: interquartile range (IQR)
  • Lower hinge: 25% tile minus 1.5*IQR
  • Upper hinge: 75% tile plus 1.5*IQR
  • Values outside hinges: outliers 100


Slide 12: Using a Boxplot

  • Location: given by lines in box, median
  • Spread: given by size of box, IQR
  • Skewness: distance between the lines
  • Outliers are clearly marked can usually tell how many and their values
Glidden_DataEx_Slide12A.jpg Glidden_DataEx_Slide12B.jpg

Slide 13: Stata Command

  • graph box varname to graph a boxplot
  • graph box varname, over( grpvar) side-by-side boxplots based on grpvar
  • group varname1 varname2, over( grpvar) side-by-side boxplots for two variables

Slide 14: qq Normal Plot

  • Graphical approach to assessing Normality
  • Horizontal axis (x axis) sorted data values
  • Vertical axis (y axis) expected data values if data Normal
  • If plot straight, data is nearly Normal
  • Shape indicates nature violation, if any

Glidden_DataEx_Slide14A.jpg Glidden_DataEx_Slide14B.jpg

Slide 15: Using qqNormal Plot

  • Right skew: plot curved up
  • Left skew: plot curved down
  • Outlier: values far off line
  • STATA: qnorm varname


Slide 16: Transforming variables


  • Make outcome more normally distributed
  • Linearize predictor effects, remove interactions, equalize outcome variance


  • Untransformed variable more credible,interpretable
  • Natural scale may be more meaningful: cost vs log cost


Slide 17: Frequency Tables

  • Used for categorical data loses no information
  • Display raw numbers of percentages
  • Can be used for continuous data
    • discards lots of information
    • may create relevant groups


Slide 18: Summary

  • Types of Data: Numerical v. Categorical
  • Numerical: mean, SD, 5 numbers
  • Numerical: histogram, boxplot, qq normal
  • Categorical: Tables
  • Transformations: potentially useful

EducationalMaterialsForm edit

Title UCSF - Data Exploration
Contributor/Contact David Glidden, PhD
Institution UCSF
Acknowledgment Please cite the appropriate contributors/authors/contacts when using or adapting these materials.
Format PDF slides
Attachment Glidden Data Exploration

Type of Course

Level of Course Beginning
Audience Graduate Student, Clinical Researcher
Topics Description Concepts in data exploration with Stata examples.
Software Program Stata


Keywords Multiple predictor regression
Date Types: Numerical (Continuous and Discrete), Categorical (Ordinal and Nominal)
Hierarchy of data types
See Also

Type of Activity Course Slides
Disclaimer The views expressed within CTSpedia are those of the author and must not be taken to represent policy or guidance on the behalf of any organization or institution with which the author is affiliated.
Topic revision: r5 - 08 Oct 2012 - 15:41:08 - MaryBanach

Copyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CTSPedia? Send feedback