As time allows, I post things on this page that might be useful to those who, like me, are knee deep in the blood, sweat, and tears of dissertation and other research. Check frequently by scrolling down. I usually add posts at the bottom when the spirit moves me.
(This page is best viewed on a desktop or laptop computer.)
When a data file includes a grouping variable, it is possible to use the SPSS Split File command (Data > Split File) to tell SPSS to rerun the same analysis on every group that is defined by the grouping variable. For instance, if you have data on job satisfaction for 400 employees in 20 work teams, you can tell SPSS to split the file by work teams and run the same analysis over and over for each of those 20 work teams. However, SPSS has different rules about how many categories your grouping variable can contain depending on whether its a string variable or a numeric variable. There is no limit to the number of categories your grouping variable can contain if the groups are coded with number and the variable is defined as a numeric variable In contrast, if the grouping variable is a string variable and the categories are coded with words or letters, SPSS limits the Split File command to only eight categories. Another great reason to avoid string variables in SPSS!
Every so often changes are made in SPSS syntax that can catch one by surprise. For instance, a couple versions back, one could write a couple simple lines of syntax to specify that analyses beyond that point would run on only a select group of cases.
For instance, if you wanted to run an analysis on only cases whose age was less than 30, you could type these two lines in a syntax file and run them:
SELECT IF (AGE LT 30).
Those two lines of syntax would simply tag unselected cases (i.e., cases who were 30 and older) and not include them in the analysis, but the cases still remained in the data file.
At some point in the recent past, however, the rules changed! Running those same two lines now doesn't just exclude the cases from the analysis, but DELETES THEM FROM THE DATA FILE!!
The syntax required to select a subset of cases for analysis but still leave the other cases in the data file is a lot more complex. For the example begun above, where you wanted to select cases for subsequent analysis who were less than 30 years old:
COMPUTE filter_$=(AGE LT 30).
VARIABLE LABELS filter_$ 'AGE LT 30 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
Yikes! I recommend that you use the menu system to generate the necessary syntax and then paste it into your syntax file in the appropriate location:
Data > Select Cases > Check "If condition is satisfied" > Click the "If..." button > type in "age lt 30" (without the quotes) > Click the "Continue" button > Click the "OK" button.
One often needs to create new variables by combining old variables. For instance, maybe you've got a 5-item inventory and need to calculate a total score, either in the form of a sum of the responses to the items or perhaps by averaging the responses to the items. This is done easily in SPSS: TRANSFORM > COMPUTE VARIABLE, then name the new variable in the "target variable" window field and type in how that new variable is to be created in the "numeric expression" expression field.
Did you know that in doing this, two approaches that you'd think are equivalent are actually not equivalent?
Look at this example. These two methods of summing responses to five items to get a scale total score are NOT EQUIVALENT:
ScaleTotal = Item1+Item2+Item3+Item4+Item5
is not equivalent to
Scaletotal = SUM (Item1 to Item5)
Similarlty, these two method of averaging responses to five items are NOT EQUIVALENT:
is not equivalent to
ScaleAverage= MEAN(Item1 to Item5)
If you have any missing data on the variables involved in the expression, the use of operators like SUM and MEAN will calculate sums and means USING WHATEVER DATA ARE AVAILABLE, and never mention this to you! On the other hand, if the calculations are written out fully by hand, missing data will result in no value being calculated. Depending on the circumstances one of these outcomes may be preferable to the other, so it's important to be care that you're getting what you want.
It's often recommended that cases with excessive amounts of missing data should be eliminated from the analysis. Assuming that you've decided what constitutes "excessive" missing data, how can you easily identify cases that exceed this value? SPSS makes it easy, if you know the trick. Here it is. Write syntax as follows:
COUNT CMISS = var1 var2 var3....varx (MISSING).
(except use the names of variables you're interested in in place of var1 ... varx). This will add a variable to your data file called CMISS which lists for each case how many values are missing in the series of variables var1 to varx. You then easily identify cases with more missing values than you're willing to tolerate.
I'm working on a project that requires measuring several aspects of diversity in corporate boards of directors. The measure of diversity I chose to use with both dichotomously-scored (i.e., 0/1) nominal scale variables and continuous variables was the standard deviation. I aggregated the data from the board members associated with each firm and calculated the standard deviations for each diversity variable. For instance, the standard deviation of the ages of the board members served as a measure of age diversity and the standard deviation of the gender variable (0=female, 1=male) served as a measure of gender diversity, and so on. The problem I found is that SPSS calculates Bessel's corrected standard deviation (with N-1 in the denominator), not the actual standard deviation (with N in the denominator). Consequently, the Bessel-corrected standard deviations reflected not only diversity, but also sample size. Take two companies, A and B. Company A has 1 male and 1 female. Company B has 2 males and 2 females. Both companies show equal gender diversity, but Bessel's corrected standard deviation as computed by SPSS is 0.71 for Company A and 0.58 for company B, and both values are higher than the theoretical upper limit for the standard deviation for binary data which is .50! What's a boy to do?! Un-Bessel the standard deviations (called "SD" below) as follows:
Transform > Compute Variable
Target Variable: New_Std_Dev
Numeric Expression: ((SD**2)/(N/(N-1)))**.5
In words: (1) Square the Bessel-corrected standard deviation (SD). (2) Divide #1 by N/(N-1). (3) Find the square root of #2.Having un-Besseled in this way, both Company A and Company B now show a standard deviation of .50.
Have you ever needed a measure of variability for a nominal scale variable? I did recently, as a way of measuring the diversity of corporate boards of directors. (I’ll give examples later.) Statistics texts focus on measuring variability of continuous variables (e.g., range, inter-quartile range, variance, standard deviation), but are remarkably silent on the matter of measuring variability in categorical variables. Kader, G. D., & Perry, M. (2007). Variability for categorical variables. Journal of Statistics Education
(2), 1-16, provide one measure, however, with the unlikely name “coefficient of unalikability,” abbreviated with the symbol, u2. The logic behind the unalikeability statistic takes them 16 pages to explain, but we can cut to the computational formula and see how it works with some examples.
Here are some data to illustrate the unalikeability statistic. The variable is Religious Preference, with four categories: Christian, Jewish, Muslim, Hindu. Shown next are frequency distributions for two samples, one with no variability on the variable, and the second with maximum possible variability.
Sample 1: shows very little variability or diversity in religious preference
||N = 8
Sample 2: shows maximum possible variability or diversity as cases are evenly distributed across categories of the variable
||N = 8
Next, here’s the formula for the unalikeability statistic:
u2 = 1 - Sum of the squared proportions
In words: (1) square the proportions associated with each category of the variable; (2) add these squared proportions; (3) subtract that sum from 1.
For the first distribution: u2 = 1 – (1^2 + 0^2 + 0^2 + 0^2) = 0
For the second distribution: u2 = 1 – (.25^2 + .25^2 + .25^2 + .25^2) = .75
You can see that where there is no variability or diversity, u2 takes on a value of 0. With a more variable or diverse distribution, the value of u2 increases appropriately to reflect that greater variability.
The only thing that’s annoying about the unalikeability statistic is that the maximum value of u2 is different depending on the number of categories. For a two-category variable (like sex, for instance), the highest possible value is u2 = 0.50. For a three-category variable, maximum u2 = 0.67. For a four-category variable, maximum u2 = .75. What this means is that one can’t use u2 to compare levels of variability in categorical variables that contain different numbers of categories. However, u2 does at least allow us to measure differences from one group to the next on the same categorical variable.I suppose that this limitation of the unalikeability statistic really isn’t that limiting. After all, we can’t compare the variances or standard deviations of two different variables either because those measures of variability are influenced not only by actual data variability but also by score magnitude.
I often get data files from clients who have not discovered yet just how much SPSS hates words! As a consequence, a variable like Gender will be treated as a string variable and is coded “male” or “female” instead of as a numeric variable with numerical codes representing the categories like 1 = “male” and 0 = “female.” In order to do any serious statistical work with these string variables, they must be recoded as numeric variables. Writing a short SPSS syntax file is the simplest way of getting this done.
Using the example above where men are coded as “male” and women are coded as “female”:
File > New > Syntax will open a blank syntax file.
Type the following commands:
recode Gender (‘male’ = 1) (‘female’ = 0) into nGender.
Then click on Run All. A new variable will be created in your data file called nGender, and cases that are described as “male” will be coded 1 in nGender, and cases that are described as “female” will be coded 0 in nGender.
There’s just one caveat, and I was frustrated beyond belief until I finally figured this out (and of course nobody thinks to mention it!!). Although SPSS is generally not case-sensitive software, it is when it comes to dealing with string variables! So guess what would happen if you wrote and ran the following syntax:
recode Gender (‘Male’ = 1) (‘Female’ = 0) into nGender.
NOTHING! Nothing gets recoded at all, because SPSS is looking for instances of “Male” to recode as 1’s and it’s only finding “male.” And it’s looking for instances of “Female” to recode as 0’s and it’s only finding “female.”The same problem with case sensitivity pops up when using the SPSS “Select Cases” command to select a subset of cases for an analysis. Returning to the example above, if you ran a Select Cases If: Gender = ‘Male’ you wouldn’t select anybody because in the data file the men are identified as "male," not "Male."
One of my clients had to send me data in waves as it became available. The cases in each incoming wave were about the same, with a few new cases added and a few old cases missing, but each incoming data file included some new variables.
How does one merge SPSS data files like this? As long as all cases in all files have unique identifying code number, it's simple. Let's call the variable that contains identifying code numbers CaseID.
Let's call the data file you are adding on to Data1 and the new file to be merged into Data1 will be called Data2.
1. Sort cases in both data files by CaseID as follows: Data > Sort Cases > move CaseID to "Sort by" window. Choose "Ascending" Sort Order > OK.
2. Save each file after sorting the cases by CaseID in ascending order.
3. Keep both files open, but activate Data1, i.e., have it showing on your screen.
4. Data > Merge Files > Add Variables > highlight Data2 > Continue.
5. Highlight CaseID in the "Excluded Variables" window > check "Match case on key variables" > move Case ID to "Key Variables" window by clicking on the arrow > OK.
Do not wait to run your statistics until you've collected your data.
What?! How can you run the statistics before you've collected your data? Of course you have to have numbers in order to do any number crunching. What I meant is that it is terribly unwise to assume that you'll be able to figure out how to run your statistics after the data have all been collected.
At least a quarter of the projects I work on require inordinate amounts of data reconstruction, repair, and reformatting before any statistical analysis can begin, simply because the researchers jumped straight to data collection without thinking through how the data would be analyzed. Admittedly, thinking about collecting data is a lot more fun than running the statistics, but there's not much point in collecting data that can't be evaluated statistically.
Here's what I always recommend. Once you've gotten your research questions nailed down and you've thought through how you want to collect data, create a mock data file. In other words, make up some numbers that you think mimic the results you'll get when you collect the real data according to your plan. Then see if you can run the statistics you need on your mock data. You may discover that you've not formatted the data file correctly and need to fix that. You may discover that the analyses you want to run require one or more additional variables. You might find that you need to code responses in a more statistics-friendly fashion. You might find out that you aren't asking the right questions at all.
It's easy to make the necessary changes at that point because you haven't already started collecting any actual data.It's a good idea to work with a statistical consultant in the planning stages of research to make sure that you're collecting data in a way that enables the best statistical analyses possible to address your research questions.
One common way of categorizing statistics involves putting some of them (like t-tests and ANOVAs) into the "significant difference tests" box, and others (like the Pearson correlation and Chi-square) into the "correlation" box. Although that can be useful, it can also be confusing, because significant difference tests can be used to establish correlational relationships between variables and correlations can be used to measure the magnitude of the difference between groups.
Suppose you've found that men and women differ significantly on a measure of political conservatism. In addition to noting that their is a significant sex difference, it would also be appropriate to say that there is a significant relationship (or correlation) between sex and political conservatism. In fact, the various measures of effect strength (e.g., Cohen's d and f statistics, the eta-squared statistic, and the omega-square statistic) that are often calculated after finding a difference to be statistically significant are actually just measures of the strength of that correlational relationship.
Suppose that you've found that there is a significant correlation between years of education and income. It would also be extremely likely that if you used a median split to divide your sample into two income groups--one "low income" and the other "high income"--you would find that the two groups would now differ significantly if you used a t-test to compare their average educational levels.So, although it's sometimes convenient to divide statistics into "difference tests" and "correlations," the fact is that the primary difference between the statistics in these categories is semantic, not mathematical.
I work a lot with data that my clients have collected on their own, that is, without the benefit of guidance during the research design phase of the project. That’s great, and I’m happy to assist those who already have their data, but it does mean that sometimes the data don’t have all the qualities we might like.
Here’s an example. We often collect data on variety of demographic variables, if only so that we can later describe the characteristics of our samples. Age is one of the demographic variables that we commonly collect, and there are good ways and bad ways of collecting this information. Too often, folks pursue one of these “bad” approaches.
Take this for example:
What is your age? (check one)
18 and younger _____
Suppose you get the following frequency counts in each category:
What is your age? (check one)
18 and younger f = 15
19-24 f = 11
25-30 f = 23
31-36 f = 19
37-42 f = 46
Now it’s time to do some simple descriptive statistics. What’s your sample’s mean age? What’s the standard deviation? Do you see the problem? Because you’ve collected age data in categories, you don’t really know what anyone’s exact age is. And because you don’t know the ages of your participants, you can’t calculate the mean, standard deviation, or any other statistic that involves the Age variable.
Some of the best advice I received as a grad student learning multivariate statistics was this: If there's any alternative to using multivariate analysis of variance (MANOVA), use it!
MANOVA is a widely used, but unwieldy statistical procedure. In comparisons of two or more groups that are defined by a single independent variable or factor (as in a one-way MANOVA), a significant multivariate effect will tell you that you have a reliable, replicable treatment effect, that is, at least the largest between-group difference is statistically significant. But different in what way? Ah, there's the rub! MANOVA creates a new dependent variate from the multiple dependent variables that were entered into the analysis, and the groups differ on this variate. But what does that variate measure? MANOVA doesn't tell you. It might be 2 parts of DV1, 4 parts of DV2, 3 parts of DV3, etc., but nothing in the output tells you. All you know, then, is that some unknown combination of your dependent variables has been created on which your groups differ significantly. Really, what point is there in knowing that the groups differ if you can't describe HOW they differ? That's when people resort to comparing the groups on each of the original dependent variables. But wasn't the point of the MANOVA to avoid those univariate tests in the first place?
And what about factorial MANOVA? Things get even more delicious here! In a two-factor MANOVA, there will be tests of two main effects and the interaction effect. Each of these tests uses a DIFFERENT VARIATE, i.e., a different combination of the original dependent variables. So the main effect of Factor A might be tested on a variate that consists of 2 parts DV1, 4 parts DV2, 8 parts DV3.... and the main effect of Factor B might be tested using a variate that is 1 part of DV1, 9 parts of DV2, 5 parts of DV3.... and the interaction effect is probably tested using yet some other combination of the original dependent variables. Again, you'll know if your independent variables (factors) had a significant effect, but you won't be able to describe the nature of that effect! What are the alternatives? Instead of the one-way MANOVA, try discriminant analysis. This procedure will tell you if there is a significant effect of a single independent variable, AND it'll tell you how the variate was constructed that produced this significant effect. So you'll know not only that the groups differ, you'll also know in what way the groups differ.
Unfortunately, there are no discriminant analysis alternatives to the within-subjects ("repeated measures") one-way MANOVA. Discriminant analysis is just a between-subjects test. Also, there is no factorial version of the discriminant analysis, so you can't use it to look for interaction effects. But if you want a multivariate form of one-way between-subjects ANOVA, try the discriminant analysis.
This time let's talk about the use of odd vs. even numbers of scale points in Likert-type rating scales. I just worked on another project in which the principal investigator had chosen to use an even number of scale points--1-4--in assessing opinions: 1 = strongly disagree; 2 = disagree; 3 = agree; and 4 = strongly agree. What's wrong with that? Where does it leave the poor soul who truly has no opinion on the issue or who is completely "on the fence?" The explanation given for using a scale with an even number of scale points is that it "forces" people to take a stand. Is that a reasonable thing to do? Isn't the purpose of the scale to find out what people think, not to force them to think one way or the other? If someone has no opinion on an issue or is really torn equally between the two alternatives, that's what a researcher should find out, but the 4-point scale doesn't allow that.
Opinions, and the rating scales that measure them, tend to be bipolar dimensions. That is, you can like the President's position on an issue or not like that position. Two poles, one indicated by low ratings (1 or 2), the other indicated by high ratings (3 or 4). Sometimes, though, we use rating scales to measure unipolar dimensions, i.e., the degree to which something is present or is the case. For instance, I might ask you to rate the degree to which your first statistics class was a rewarding experience from 1 = it was a train wreck! to 5 = it was magical! In a case like this, the use of an odd number of rating scale points that provides for a center point is even more important. Think about how most attributes are distributed. Normally, right? The bell curve. That is, most cases score IN THE MIDDLE. Those in the middle ranges of an attribute need a place to mark their rating scale and it needs to me in the middle. Only if the scale is 5 points, 7 points, or maybe 9 points is there a middle position to mark. With an even number of scale points, where do the cases who are in the center of the distribution mark their scale? If you're in the planning stages of your thesis or dissertation research, I'd like to work with you to make sure that the simple decisions (like choosing the proper number of rating scale points) are made thoughtfully. It's vastly easier to work with data from a well-designed study than to cobble together results from a study that rushed forward without making reasoned choices in the design stages.
I was recently called upon to write an argument to convince a dissertation chair of something I’d always assumed everyone knew: One can’t claim that a non-manipulated Independent Variable exerts a causal impact on a Dependent Variable. Only when the researcher randomly assigns cases to the Experimental and Control groups (i.e., manipulates the Independent Variable) can one draw causal conclusions.
The causal-comparative design looks on its face very much like an experiment. To use Campbell and Stanley’s (1963) method of designation:
Treatment Group: X O
Comparison Group: O
What makes it a quasi-experimental design rather than a true experiment, is the absence of random assignment of cases to groups.
Suppose one wished to evaluate the effectiveness of meditation in reducing trait anxiety. Using the causal-comparative approach, you’d identify a group of people who practiced meditation (the treatment X being evaluated) and a group of people who don’t practice meditation and then observe to see if their scores on a trait anxiety (O) measure differ significantly.
Suppose that meditators were found to be significantly less anxious than meditators. Would this indicate that it was their practice of meditation that caused this lowered anxiety? NO! It’s just as likely that people are already low in anxiety are drawn to meditation and that people who are high in anxiety are too jittery to sit still long enough to meditate! In other words, it is just as likely that people’s preexisting stress levels caused them to choose or shun meditation as it is that meditation caused lower levels of anxiety.
If the causal-comparative design does not establish that the IV and DV are causally related, what does it establish? Only that the IV and DV are correlated. There’s nothing causal in the conclusions that we can draw from the “causal-comparative” method.
I guess my intro stat teacher taught me something because I was surprised to learn that so many people don't know the answer to this question. This is a short post because the answer is short: It's because the measures that are used to measure asymmetry of a distribution produce a value that's positive when the distribution is positively skewed, and negative when the distribution is negatively skewed. Nothing fancier than that! For instance, the simplest possible measure of skewness is:
Skewness = Mean - Median
If the distribution is symmetrical (i.e., no skew) the measure will be 0. If the distribution is positively skewed, the measure will take on a positive value. And if the distribution is negatively skewed, the measure of skewness will take on a negative value.
Although the output from many SPSS statistical procedures is rich (sometimes too rich), that output is not always clearly labeled. This is particularly true when it comes to information about correlations, partial correlations, and semi-partial correlations (called "part" correlations by SPSS) that is included in the output from the Regression procedure (at least it's included if you've selected "part and partial correlations" in the Statistics dialog box).
Here's a quick guide to reading Regression output pertaining to correlations, partial correlations, and semi-partial correlations.
Let's start with a simple example of partial and semi-partial correlations. Suppose that you want to study the relationship between the number of cigarettes people smoke per year (your X variable) and the number of colds they catch per year (your Y variable). However, you think that stress may cloud this relationship because stress could affect both X (amount of smoking) and Y (how many colds people get). In a partial correlation analysis, you would treat stress as your nuisance or mediating variable (we'll call it A) and you'd control statistically for the influence of stress on both cigarette smoking and colds because stress can reasonably be expected to impact both your X and Y variables.
For an example of semi-partial correlation, suppose you are looking at the correlation between employee age (X) and job satisfaction (Y), and you see that the correlation is positive: older workers are more satisfied than younger workers. You think, however, that this relationship might be mediated by income (A): older employees might be more satisfied with their jobs than younger employees, not because of age per se, but because older employees make more money. Here, you would need to statistically control for the effects of income on just job satisfaction, but not age, because income could affect satisfaction but income could not affect age.
Now that we've seen a couple examples of when partial and semi-partial correlation might be used, let's get some symbols out of the way.
X, Y, and A are the names of our variables as follows:
X and Y = the primary variables. It is the relationship between these two variables that you're mostly interested in, like the relationship betwen X = temperature and Y = burglaries.
A = the nuisance or mediating variable that you want to control statistically in order to more clearly study the relationship between X and Y. For instance A = the number of homeowners who are away on vacation.
Now here are symbols for the various types of correlations:
Rxy = the Pearson correlation between X and Y. SPSS calls this the "zero-order" correlation because in this correlation there are "zero" nuisance or mediating variables being controlled.
Rxy.a = the partial correlation between X and Y, controlling both statistically for the influence of A
Rx(y.a) = the semi-partial ("part") correlation between X and Y, statistically controlling only Y for the influence of A
Ray = the Pearson correlation between A and Y
Ray.x = the partial correlation between A and Y, controlling both statistically for the influence of X
Ra(y.x) = the semi-partial ("part") correlation between A and Y, statistically controlling only Y for the influence of X
All six of these correlations are listed in the SPSS output box labeled "Coefficients." But they are not clearly labeled, so it's hard to know which numbers mean which things. Here is how the correlations are organized in the SPSS Output:
As is typical in SPSS, they've given us far more information than we needed (or wanted) in order to answer our research questions. For the example of the partial correlation given earlier, we'd want to know the correlation between cigarette smoking and colds, Rxy, and the partial correlation between cigarette smoking and colds, controlling both statistically for the influence of stress, Rxy.a, but none of the other values provided by SPSS are relevant.
For the example of the semi-partial correlation, we'd want to know the correlation between age and job satisfaction, Rxy, and the semi-partial correlation between age and satisfaction, controlling satisfaction statistically for the influence of salary, Rx(y.a), but none of the rest of the output is relevant.
One common practice used to control for one form of response bias when collecting rating scale data is to reverse-word some items. Survey respondents who are presented with a series of items that are all positively worded (e.g., "I think that statistics is fun." "I have a good time working with statistics." "I enjoy reading about statistics.") can lead some respondents to start checking all high (or all low) ratings without really thinking about what they're doing. By reverse-wording some of the items (e.g., "I do not think that statistics is fun." "I do not have a good time working with statistics." "I do not enjoy reading about statistics.") the idea is that respondents are forced to read the items more closely which should lead them to make more informed ratings. Of course reverse-worded items must be reverse-scored before total scores are calculated.
While this seems like a good idea in theory, in practice I've observed something else when inventories are constructed of a mixture of positively-worded and negatively-worded items. Factor analyses of these inventories frequently find two factors--one on which positively-worded items load strongly, and one on which negatively-worded items load strongly. That means that the internal consistency of the inventory is challenged and that means the inventory totals (or averages) are invalid. Other researchers have noted that ratings given to positively worded items show less variability that ratings to negatively worded items--suggesting that people may have some trouble comprehending negatively worded items.
Perhaps a more desirable approach to breaking response bias is to mix together the items from two (or more) inventories measuring very different constructs. Word items in both inventories in a positive direction, but force respondents to slow down and read the items by ensuring that any two consecutive items deal with different issues.
Every so often I stumble across another SPSS quirk. Recently I was doing some customized partial and semi-partial correlations that aren't available through the SPSS menu system, but rather, required that I residualize some variables myself. Using the SPSS Regression procedure, I saved the "standardized residuals" to my data file. Just out of curiosity, and my natural tendency to check everything twice, I calculated the mean and standard deviation of these "standardized residuals." Not surprisingly, the mean was 0, as it should be. However, the standard deviation was NOT 1.0, but in the .9's. In true standard scores, the mean will always equal 0 and the standard deviation will always equal 1.0.
To get TRUE standardized residuals, one must save the unstandardized residuals, then covert those to standard score form using that function with the SPSS Descriptives procedure (Analyze > Descriptive Statistics > Descriptives; then check "Save standardized values as variables." If you create your standard scores in that fashion, you'll find that the mean is 0 and standard deviation is 1.0.
Some years ago, when I wrote my text, Statistics for the Social and Behavioral Sciences: Univariate, Bivariate, and Multivariate , I included a description of a method by which one can recode the values of a continuous variable so as to have any desired mean and standard deviation. This is useful in a surprisingly large number of situations. For instance, supposed you have a measure of Spiritual Well Being (SWB) that seems also to reflect a certain amount of plain old depression. To remove that "contamination" from the SWB variable, you can regression SWB on a measure of depression (like the Beck Depression Inventory), and save the residuals. Those residuals represent SWB from which the influence of depression has been removed. The problem is, the residual scores look VERY different than did the original SWB scores. It doesn't really matter, because strong positive values still represent large amounts of SWB and strong negative values still represent small amounts of SWB. Even so, it can be disconcerting to work with numbers that look so unfamiliar. Solution: recode the residual scores to have what mean and standard deviation you want them to have--maybe something familiar like a mean of 100 and standard deviation of 15 like IQ scores.
How do you do that? The method of modified z-scores. (1) Standardize the residualized SWB variable, i.e., convert the residuals to z-score form. See my comment immediately above for advice on standardizing residuals. (2) Create your new variable with a compute statement as follows: NewVar = (desired standard deviation * z) + desired mean. (3) Finally, always be sure to check the mean and standard deviation of your new variable to be sure it has the desired values.
As a user of SurveyMonkey, I have been annoyed that that platform does not provide a direct measure of the time it takes respondents to complete surveys. SurveyMonkey collects this information as metadata, and you can even filter responses that take greater than X minutes or less than X minutes, but the data file downloaded from SurveyMonkey does not include response time as a variable. Survey response time is an important variable for use in screening data quality, particularly in identifying "speeders" who have not given the survey sufficient attention for their data to be considered valid. Here's how to create a variable that will provide a measure of survey response time in minutes.
First, your data must be exported from SurveyMonkey as an Excel file. In Excel, using Format Cells,. transform both of the Date/Time columns into a custom format: "dd-mmm-yyyy hh:mm:ss" so that SPSS will be able to recognize the data. Then import that Excel file into SPSS. Now, go the the Variable View and find the two Date/Time variables. In the Type column, select Date and check the "dd-mmm-yyyy hh:mm:ss" radio button and click OK. Finally create and run the syntax file shown below to create a new variable called TimeElapsed which measures response time in minutes:
COMPUTE TimeElapsed = DATEDIF(EndDate, StartDate, "minutes").
A variety of methods are available to help identify survey respondents who, for whatever reasons, have not provided thoughtful responses to the survey questions. This entry describes one such approach based on calculating a measure of the variability of each respondent’s ratings.
Suppose that a series of rating scales all measure the same construct and that they all measure this in the same direction (i.e., items that need to be reverse-scored have been reversed). Any instrument that displays a Cronbach’s alpha of .80 or higher would illustrate this. In that case, respondents who possess a certain amount of the attribute should show some variability in their ratings across those items, but not too much and not too little. Too little variability would suggest that the respondent isn’t thinking carefully enough about subtle differences between the items. Too much variability would suggest that the respondent is responding almost at random.
Look at the following ratings on a 1-7 scale for five items intended to measure procrastination. Low scores indicate low levels of procrastination and high scores indicate high levels. Rows 1-3 come from respondents who were careful in giving ratings about their tendency to procrastinate; rows 4-6 are from respondents who are giving honest responses, but aren’t thinking very carefully about the subtle differences from question to question; and rows 7-9 are from respondents who are responding randomly or in some pattern, perhaps just wanting to finish the survey quickly.
5 6 5 5 7
1 2 2 1 3
3 4 3 3 5
6 6 6 5 6
1 1 1 1 2
4 4 4 4 4
1 7 6 4 7
1 3 5 7 1
1 7 1 7 1
If the variables (items) are named Item1 thru Item 5, the following syntax will calculate the standard deviation of each participant’s ratings:
COMPUTE StdDev = SD(Item1, Item2, Item3, Item4, Item5).
COMPUTE StdDev = SD (Item1 TO Item5).
The following standard deviations would result for the nine respondents shown: .89, .84, .89 .45, .45, .00, 2.55, 2.61, 3.29
You can see that standard deviations that are either too low (rows 4, 5, 6) or too high (rows 7, 8, 9) are indicative of low quality data. In any given data set, what constitutes “too low” or “too high” needs to be determined empirically by examining a frequency distribution of the standard deviation values and seeing how various standard deviations match up with different response patterns.
Copy and Paste Large SPSS Tables and Figures Into Word Documents
Sometimes SPSS tables are so wide that when you copy-and-paste them into your Word document they are too wide to fit. To copy and paste one of those extra-wide SPSS tables use this procedure:
1. Right click on the table in SPSS output that you want to copy
2. Left click on Copy Special
3. Select "Image" and de-select everything else
4. Click on OK
5. Left click in your Word document where you want the table to appear
6. Right click, then click Paste
The DISADVANTAGE: Content copied like this into Word cannot be edited as would normally be the case, so any editing of the table you want to do will have to be done in SPSS.
It is helpful to number each case (line) in your SPSS data file so that you can refer to cases by their numbers. A convenient way of generating a variable "CaseID" which consists of consecutive case numbers for a data file is to create and run a syntax file as follows:
COMPUTE CaseID = $CASENUM.
FORMAT CaseID (F8.0).
That will create a variable labeled CaseID which numbers each line of your data file from 1 thru N.
Suppose now, though, that you delete the data from case #3 for some reason--perhaps the case identified as a "flatliner" (no variability from one rating scale to the next) or a "speeder" (completed your survey in an unreasonably short period of time) or something like that. With that case removed, your CaseID numbers will now run 1, 2, 4, 5.... N.
Now suppose you run an analysis in SPSS in which the output identifies cases--perhaps like the Explore procedure which can be used to identify univariate extreme scores and outliers. Further suppose the case that you've designated as #4, which now occupies the third line of the data file, is such an outlier. SPSS won't identify that case by the CaseID number that you've assigned. Rather, since the case occupies the third line in the data file, the outlier will be identified by its line number (3) not its CaseID number (4).
In Tabachnick and Fidell's multiple editions of Using Multivariate Statistics, they offer formulas for performing square root, log10, and reciprocal data transformations, the point of which is to normalize the data distribution. This is a relatively straightforward process when working with correlational statistics, but it gets a little more confusing when one is trying to normalize the data for a factorial ANOVA. Remember that the assumption in factorial ANOVA is that the distributions should be normal within each cell of the factorial design. In the formulas that the authors provide, they mention constants "k" = "the largest score + 1" and "c" = "a constant added to each score so that the smallest score is 1." In factorial ANOVA, the question arises, is "k" the largest score + 1 in each cell? That is does one use a different value of k in transforming the data in each cell? Or is k the largest score across all the cells of the design + 1. Dr. Tabachnick was kind enough to clarify this for me. The constant k is found by determining the largest value of the dependent variable in any of the cells of the design, then adding 1 to that value. Similarly, the value c is created so that the smallest score across all the cells is 1.
One other comment about attempting to normalize data in the cells of a factorial ANOVA: Good Luck! The same transform must be applied in all cells and it's almost certainly going to make matters worse in some cells while improving the distributions in other cells. In other words, an idea that sounds good in theory isn't actually very useful most of the time. One might be better off to use nonparametric statistics when they're available, or to make sure that sample sizes in all cells are reasonably large (e.g., >30) so that the Central Limit Theorem ensures that the sampling distribution of the means is normally distributed even if the samples are not. If there are no nonparametric alternatives and your sample size is small, you can always just use a more stringent (p < .001) level of significance in evaluating the significance tests and caution your readers that the normality assumption was violated.
When using the bar graphs that are generated by the Frequencies procedure in SPSS, if you've told SPSS to represent percentages on the vertical axis you need to be aware that your category bars will indicate the percentage of those who responded to the item that fall into each category of the variable. In other words, these are "valid percentages," not "percentages." The bar graphs generated by the Frequencies procedure do not give you the option of plotting "percentages," i.e., the percentage of the whole sample, those who responded and those who didn't,that fall into each category. To plot percentages of the whole sample in each category, including the "missing" or "nonresponding" category, click on Graphs (in the menu bar) > Legacy Dialogs > Bar > Simple (for most situations) > Define > Options, then in the Missing Values area, check "Display groups defined by missing values." The resulting bar graph will indicate percentages of the whole sample fall into each category of variable, including the missing values category.
Why is it that a dichotomously-scored (two-category) nominal scale variable can be used as a predictor in statistical procedures like multiple regression and discriminant analysis? (Yes, discriminant analysis too, notwithstanding statements from lots of authors that this method requires continuous predictor variables, i.e., interval or ratio scale variables.) It's because a dichotomously-scored nominal scale variable (often called a binary variable, because the only two scores are 0 and 1) is actually a ratio scale variable. How can it be that what appears to be a 2-category nominal variable is actually a ratio scale variable? Here's your answer. A ratio variable is one in which a score of 0 = none of the attribute and each successive score point,--1, 2, 3, ...--represents one more fixed-sized unit of the attribute being measured. Now let's take the dichotomous variable, "Military Veteran," with two categories coded no = 0 and yes= 1. Someone with no military service is scored 0 because they possess none of the attribute being measured. Someone with military service is scored 1 because they possess one unit of military service. There don't happen to be any cases scored 2, 3, 4 or higher because we don't need those values. We just need 0 for none and 1 to represent those individuals with one "fixed size unit" of military service. But lack of need for scores above 1 doesn't somehow disqualify a binary variable from being ratio scale. Any dichotomous variable can therefore be treated as a nominal variable or as a ratio variable, whichever is more convenient at the moment.
In the presence of skewed data distributions, one or another of the commonly used data transformations (square-root, log10, or reciprocal) can help to normalize the data. Any of the texts by Tabachnick & Fidell are a good source of information about score transformations. Of course the transformed scores take on different numerical values than the original scores, sometimes even resulting in score reflection--the lowest original scores become the highest transformed scores and the highest original scores become the lowest transformed scores. Reflected scores can be (and should be) re-reflected (again, see Tabachnick and Fidell, 2013) to solve that problem, but nothing will change the fact that the transformed scores will still take on different values than the original scores. This can be vexing in some applications. For instance, where the mean of a set of original IQ score might have a value of 100 and be perfectly interpretable ("average"), the transformed mean might be 10, not a directly meaningful value. One should not make too much of these changed values, though. Assuming that one has re-reflected reflected scores, high transformed scores still reflect greater amounts of the attribute and low transformed scores still reflect lessor amounts of the attribute. This is a point that is missed by some, who make far too much of the fact that transformed scores take on different numerical values than the original scores, to the point that they believe that the transformed scores no longer measure the same construct that was measured by the original scores. The quickest way to dispel this misimpression is to examine the correlation between original and transformed scores. That correlation typically runs in the upper 0.90's. It is axiomatic in statistics that, to the extent that two variables are correlated, they measure the same thing. Data transformations do not change the construct being measured; just the score values and the shape of the distribution of the scores.
G* Power does not provide any direct application for estimating sample size requirements or statistical power for partial or semipartial ("part") correlation analysis. Various workarounds are suggested, ranging from the impossibly complex to simply ignoring the presence of the covariates and using the same G*Power app that is used for Pearson correlation. The procedure I favor is to use the G*Power app for estimating sample size (in an a prior power analysis) or power (in a post hoc power analysis) for test of the significance of the individual predictors in multiple regression analysis. This approach is based on the fact that the tests of significance of the individual predictors in SPSS multiple regression are also the tests of the significance of the related partial and semipartial correlations.
Here is the G*Power procedure:
Tests > Correlation and regression > Linear multiple regression: Fixed model, single regression coefficient
Type of power analysis: a prior
Tail(s): chose one-tail if you've predicted the sign of the partial or semipartial correlation or chose two-tail if you're equally interested in a correlation in either direction
Effect size f-squared: Use .02 for a weak population effect, .15 for a medium strength effect, or .35 for a strong population effect
Alpha err prob: your chosen level of significance, typically .05
Power (1 - beta err prob): your chosen level of statistical power, typically .80 which gives you a Type II error probability of .20
Number of predictors: set equal to the total number of variables in the analysis (X, Y, and all covariates) minus one.
Then click "Calculate"