Tags:
create new tag
, view all tags, tagging instructions

# Title: UCSF - Data Exploration

## Slide 1: Multiple Predictor Regression

• Assess the relationship between an outcome and multiple predictors
• Powerful tool for:
• understanding complex relationships
• controlling confounding
• prediction / risk stratification
• Regression models differ by outcome type, but all have much in common

## Slide 2: Data Types ## Slide 3: More on Data Type

• Data type implies plausible probability distn
• Different data summaries the sample mean not always interpretable
• Distinctions between different types can be flexible

## Slide 4: Model depends on outcome type

• Continuous -- linear or gamma model
• Discrete (counts)- Poisson/negative binomial
• Binary -- logistic, relative risk models, survival models when follow-up varies
• All easily implemented in Stata

## Slide 5: Data Exploration

• Find data errors
• Assess missingness
• Detect anomalous observations and outlying data values
• Select appropriate analysis methods
• Support a formal data analysis

## Slide 6: Data Example

• Western Collaborative Group Study
• Large early observational study (n=3154)
• Association between "type A" behavior and coronary heart disease (CHD)
• Example variable: systolic blood pressure

## Slide 7: Descriptive Output ## Slide 8: Cholesterol Data with Outlier ## Slide 9: Histograms • Shows location, spread, and shape of the distribution
• Horizontal axis: intervals or "bins" in which data values are grouped
• Vertical axis: number, fraction, or percent of the observations in each bin

## Interpreting Histograms

• Pattern of bar heights conveys shape of distribution:
• number of modes
• skewness
• long or short tails
• Usefulness depends on number of bins
• too many defeats goal of summarization
• too few obscures shape of distribution  ## Slide 10: Stata Commands

• histogram varname to graph a histogram
• histogram varname, bin(x) histogram with x bars
• histogram varname, freq histogram with frequency not fractions

## Slide 11: Boxplot

• Box with upper & lower hinges
• Box: 25% tile, median, 75% tile
• Length of box: interquartile range (IQR)
• Lower hinge: 25% tile minus 1.5*IQR
• Upper hinge: 75% tile plus 1.5*IQR
• Values outside hinges: outliers 100 ## Slide 12: Using a Boxplot

• Location: given by lines in box, median
• Spread: given by size of box, IQR
• Skewness: distance between the lines
• Outliers are clearly marked can usually tell how many and their values  ## Slide 13: Stata Command

• graph box varname to graph a boxplot
• graph box varname, over( grpvar) side-by-side boxplots based on grpvar
• group varname1 varname2, over( grpvar) side-by-side boxplots for two variables

## Slide 14: qq Normal Plot

• Graphical approach to assessing Normality
• Horizontal axis (x axis) sorted data values
• Vertical axis (y axis) expected data values if data Normal
• If plot straight, data is nearly Normal
• Shape indicates nature violation, if any  ## Slide 15: Using qqNormal Plot

• Right skew: plot curved up
• Left skew: plot curved down
• Outlier: values far off line
• STATA: qnorm varname ## Slide 16: Transforming variables

Rationale:

• Make outcome more normally distributed
• Linearize predictor effects, remove interactions, equalize outcome variance

Drawbacks:

• Untransformed variable more credible,interpretable
• Natural scale may be more meaningful: cost vs log cost ## Slide 17: Frequency Tables

• Used for categorical data loses no information
• Display raw numbers of percentages
• Can be used for continuous data
• may create relevant groups ## Slide 18: Summary

• Types of Data: Numerical v. Categorical
• Numerical: mean, SD, 5 numbers
• Numerical: histogram, boxplot, qq normal
• Categorical: Tables
• Transformations: potentially useful

### EducationalMaterialsFormedit

 Title UCSF - Data Exploration Contributor/Contact David Glidden, PhD Institution UCSF Acknowledgment Please cite the appropriate contributors/authors/contacts when using or adapting these materials. Format PDF slides Attachment Glidden Data Exploration URL_Web_Link Type of Course Level of Course Beginning Audience Graduate Student, Clinical Researcher Topics Description Concepts in data exploration with Stata examples. Software Program Stata Datasets Data Keywords Multiple predictor regressionDate Types: Numerical (Continuous and Discrete), Categorical (Ordinal and Nominal)Hierarchy of data typesOutliersBoxplotsHistograms See Also Type of Activity Course Slides Disclaimer The views expressed within CTSpedia are those of the author and must not be taken to represent policy or guidance on the behalf of any organization or institution with which the author is affiliated.
Topic revision: r5 - 08 Oct 2012 - 15:41:08 - MaryBanach

Copyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CTSPedia? Send feedback