This is the pollution data so loved by writers of papers on ridge regression. Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463- 482. Variables in order: PREC Average annual precipitation in inches JANT Average January temperature in degrees F JULT Same for July OVR65 % of 1960 SMSA population aged 65 or older POPN Average household size EDUC Median school years completed by those over 22 (over 25 in 1960?) HOUS % of housing units which are sound & with all facilities DENS Population per sq. mile in urbanized areas, 1960 NONW % non-white population in urbanized areas, 1960 WWDRK % employed in white collar occupations, 1960 POOR % of families with income < $3000, 1960 HC Relative hydrocarbon pollution potential NOX Same for nitric oxides SO@ Same for sulphur dioxide HUMID Annual average % relative humidity at 1pm MORT Total age-adjusted mortality rate per 100,000 Data pooled from a variety of sources. Summary of data from: Instabilities of Regression Estimates Relating Air Pollution to Mortality Gary C. McDonald and Richard C. Schwing Technometrics, Aug., 1973, Vol. 15, No. 3 (Aug., 1973), pp. 463-481 " The total age adjusted mortality rate, our response variable in each regression equation, can be obtained for the years 1959-1961 for 201 Standard Metropolitan Statistical Areas (SMSA) from Duffy and Carroll [4]. In Table 5 of [4], the age-adjusted death rates are given for the categories male white, female white, male non-white and female non-white. In addition, the number of deaths in each of these four categories is also provided. We define our total age adjusted mortality rate to be MR = (\sum D_i) (\sum (D_i/R_i))^{-1}, where D_i and R_i are the deaths and age adjusted death rates of, say, the ith category respectively, i = 1, 2, 3, 4. The sums are then taken over the four categories. ... The pollution potential of three pollutants, namely HC, NO_x, SO2, have been estimated by Benedict [1]. The pollution potential is determined as the product of the tons emitted per day per square kilometer of each pollutant and a dispersion factor which accounts for mixing height, wind speed, number of episode days and dimension of each SMSA. Since each SMSA has the same dispersion factor for each pollutant, this quantity is "confounded" with each pollution potential term. Benedict's pollution potentials are available for sixty SMSA's, for the year 1963, which are geographically consistent with the available mortality data. Note, however, that the time period for which the pollution potentials apply (1963) is slightly later than the time period applicable to the mortality data. Though the pollution variables are labeled HC, NO,, and S02, there are other variables, especially other pollutants, which are highly correlated with each of these indices. For example, SO2 is highly correlated with certain types of particulates and HC is closely tied to carbon monoxide and lead salts. Thus one cannot demonstrate a specific cause and effect even though the analysis quantifies the relationship. "