Lesson 4a : Simple Statistics

The main purpose of SAS is to perform statistical tests on data. In lesson 4, we will be talking about some of the simple tests that you can perform. These include finding the means, standard deviations, and ranges of your data, finding the cross correlation of the data, and performing some simple t-tests.

The first statistical test we'll look at is proc freq, which measures frequencies.


Example 4.1

You teach a small class of 15 students. You have data of their gender, their age, and their final grade in percentages. You would like to make histograms of the class, broken down by age, by gender, and by grade. Plus, you would like to make a cross table histogram of gender and age.

The code is as follows. It can also be found under lesson4-1.sas.

/* Example 4-1 */
options linesize=80 pagesize=54 pageno=1;
data students;
        input gender $ age grade;
        cards;
        m 22 86
        f 21 81
        f 35 92
        f 20 55
        m 22 41
        m 22 71
        f 20 79
        f 19 66
        m 20 98
        f 21 89
        f 19 71
        m 20 31
        m 21 82
        f 20 71
        f 18 91
        ;
run;

proc freq data=students;
        tables age gender grade;
run;

proc freq data=students;
        tables gender*age;
run;

In this case, we define each student by their gender, their age, and their final score in the class. Then we use proc freq twice. After we declare proc freq and the data set, we have to say what tables to make. In the first case, we want to make three tables - one for each of age, gender, and grade. The second time we run proc freq, we want a table of gender by age. Do to this, we insert the '*' character between the variables.

To understand what proc freq does, let's look at the sample output.


Output from proc freq

                                 The SAS System                                1
                                           Cumulative  Cumulative
               AGE   Frequency   Percent   Frequency    Percent
               ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                18          1       6.7           1        6.7
                19          2      13.3           3       20.0
                20          5      33.3           8       53.3
                21          3      20.0          11       73.3
                22          3      20.0          14       93.3
                35          1       6.7          15      100.0



                                            Cumulative  Cumulative
              GENDER   Frequency   Percent   Frequency    Percent
              ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
              f               9      60.0           9       60.0
              m               6      40.0          15      100.0



                                           Cumulative  Cumulative
              GRADE   Frequency   Percent   Frequency    Percent
              ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                 31          1       6.7           1        6.7
                 41          1       6.7           2       13.3
                 55          1       6.7           3       20.0
                 66          1       6.7           4       26.7
                 71          3      20.0           7       46.7
                 79          1       6.7           8       53.3
                 81          1       6.7           9       60.0
                 82          1       6.7          10       66.7
                 86          1       6.7          11       73.3
                 89          1       6.7          12       80.0
                 91          1       6.7          13       86.7
                 92          1       6.7          14       93.3
                 98          1       6.7          15      100.0



                                 The SAS System                                2
                              TABLE OF GENDER BY AGE
    GENDER     AGE
    Frequency‚
    Percent  ‚
    Row Pct  ‚
    Col Pct  ‚      18‚      19‚      20‚      21‚      22‚      35‚  Total
    ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
    f        ‚      1 ‚      2 ‚      3 ‚      2 ‚      0 ‚      1 ‚      9
             ‚   6.67 ‚  13.33 ‚  20.00 ‚  13.33 ‚   0.00 ‚   6.67 ‚  60.00
             ‚  11.11 ‚  22.22 ‚  33.33 ‚  22.22 ‚   0.00 ‚  11.11 ‚
             ‚ 100.00 ‚ 100.00 ‚  60.00 ‚  66.67 ‚   0.00 ‚ 100.00 ‚
    ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
    m        ‚      0 ‚      0 ‚      2 ‚      1 ‚      3 ‚      0 ‚      6
             ‚   0.00 ‚   0.00 ‚  13.33 ‚   6.67 ‚  20.00 ‚   0.00 ‚  40.00
             ‚   0.00 ‚   0.00 ‚  33.33 ‚  16.67 ‚  50.00 ‚   0.00 ‚
             ‚   0.00 ‚   0.00 ‚  40.00 ‚  33.33 ‚ 100.00 ‚   0.00 ‚
    ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
    Total           1        2        5        3        3        1       15
                 6.67    13.33    33.33    20.00    20.00     6.67   100.00


SAS handles a single variable table differently than it does a two variable table. In the single variable case, SAS outputs the count for each category, gives a percentage of the final count, a cumulative frequency, and a cumulative percentage.

In the two variable case, the output is more complicated. For each cell, SAS outputs the count for each variable, and also gives you a grand percentage, a row total percentage, and a column total percentage. For example, looking looking at row 1, column 3, which describes females aged 20, SAS gives us these statistics:

      3 ‚  <- Grand total of females aged 20
  20.00 ‚  <- Percentage of students who are both female and 20
  33.33 ‚  <- Row percentage, or percent of females who are aged 20
  60.00 ,  <- Column percentage, or percent of 20 year olds who are female

SAS also gives the row and column totals at the end of the chart. These are the same values if we just did a single variable call to proc freq. For example, the bottom row of the chart looks like:

    Total           1        2        5        3        3        1       15
                 6.67    13.33    33.33    20.00    20.00     6.67   100.00

Remember, each row was a different age. If you compare these values to our frequency table of ages, these are the exact same.

This is the default output in proc frec. However, it also has many different optional statistics which you can include. For example, if you want to include expected values and deviations from the expected values, you can change the code slightly to read:

proc freq data=students;
        tables gender*age / expected deviation;
run;

The options are included after the slash ("/" character). For further options, please see the procedure index.


To continue on with lesson 4b and calculating means and variances, go here.

Or to just go back to the table of contents, click here.