UNIVERSITY AT BUFFALO, THE STATE UNIVERSITY OF NEW YORK
The Department of Computer Science & Engineering

STUART C. SHAPIRO: CSE 305

CSE 305
Programming Languages
Lecture Notes
Stuart C. Shapiro

Data Types

The standard definition of data type is (see Sebesta, p. 248):

A collection of data values;
and a set of operations on those values.

Values vs. Objects

The distinction is not clear.
"Value" is used most often for "the value of a variable".
An "object" is "beefier", and generally has attributes and/or methods.

"The word object is often associated with the value of a variable and the space it occupies. In this book, however, we reserve object exclusively for instances of user-defined abstract data types, rather than also using it for the values of variables of predefined types. In object-oriented languages, every instance of every class, whether predefined or user-defined , is called an object." [Sebesta, p. 249 (italics in the original)]

"Ruby is a completely object-oriented language. Every value is an object, even simple numeric literals" [Flanagan & Matsumoto, The Ruby Programming Language, 2008, p. 2 (italics in the original)].

This is immediately followed by an example of numeric literals having methods:

irb(main):001:0> 1.class
=> Fixnum

The major steps in the evolution of data types were:

A few basic built-in types, such as integers, reals, and homogeneous arrays.
Fixed size, heterogeneous aggregates (records, structures).
User-defined data types.
Abstract Data Types (ADTs).
Objects (in the OO sense).

The rest of this chapter is a survey of data types and their design issues.

Primitive Data Types

are data types not defined in terms of other data types.

Numbers

Integers

Often there is an unsigned type for binary data, and several types of signed integers, differing by length (number of bytes used).

Various coding schemes are possible. Most languages now use binary numbers for positive integers, and twos complement for negative integers.

Bignums are integers with unlimited length. For example, in Ruby,

irb(main):001:0> def fact(n); if n<=1 then 1 else n*fact(n-1) end; end
nil

irb(main):002:0> fact(4)
24

irb(main):003:0> fact(100)
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

irb(main):004:0> fact(100).class
Bignum

Common Lisp also has bignums.

Fixed-Point

Fixed number of digits with a fixed decimal point position. Used for business applications, including currency.

Represented by binary coded decimal (BCD). Each digit represented by its binary equivalent. For example, 35 in BCD is 0011 0101.

Floating-Point

Called "floating-point" because the decimal point floats so that the number is represented as
[+|-] (1|2|3|4|5|6|7|8|9) . {<digit>} E [+|-] {<digit>}

Usually represented using IEEE Floating-Point Standard: sign bit, exponent, fraction ("mantissa"). For more details of number representations, see my CSE115 notes on Java arithmetic.

Usually several types, differing on precision (number of bits used for fractional part).

Ratios

Common Lisp is among several languages that have a ratio numeric type:

cl-user(12): (/ 36 10)
18/5

cl-user(13): (type-of (/ 36 10))
ratio

cl-user(14): (+ 18/5 2)
28/5

Complex Numbers

Several languages including Common Lisp and Ruby have complex numbers. In Common Lisp, for example,

cl-user(15): (sqrt 4)
2.0

cl-user(16): (sqrt -1)
#C(0.0 1.0)

cl-user(17): (type-of (sqrt -1))
complex

Operations on numbers will be discussed in Chapter 7.

Booleans

The data type for conditional expressions. There are only two values, true and false.

Only some programming languages (Java and perhaps Haskell) have an actual Boolean type with two special values, True and False. Some languages have Boolean values, but allow other values to count as them. C uses the int 0 for False, and any other int for True. Lisp uses nil for False and any other value for True.
One test for a Boolean type:

define lessThan(x,y)
  return x<y;

but what are the possible values of a logical expression?

Characters

Many languages have a data type for single characters.

Often represented in ASCII, which uses 8 bits, and so can code 128 differet characters.

There is a move, started by Java to use Unicode, which uses 16 bits, and can represent character's from most of the languages in the world.

Strings

The use of strings as a data type grew from the need of strings for output. However, it's one thing to be able to write strings; it's another to be able to store them in variables and operate on them.

Many languages have a data type named something like string, others use arrays of characters. However, strings are usually implemented as arrays of characters.

The length of a string may be stored with the value or the variable, or may be indicated by a sentinal. For example, C and C++ terminate strings with the null character, '\0'.

String concatenation is such a common operation that several languages include an operator for it, such as Java's overloaded +. Java uses concatenation to construct output lines. Other languages use format strings with interpolated control characters.

Some other common operations are: string length; substring extraction; character at position; string comparison; and substring search.

A major issue is whether string operations are destructive (change the argument string) or non-destructive (return a string like the argument string, except...). In Java, Strings are immutable (have no destructive operations), whereas StringBuffers are like Strings, but are mutable:

bsh % str1 = "This is a string.";

bsh % str2 = str1.replace('i', 'y');

bsh % print(str2);
Thys ys a stryng.

bsh % print(str1);
This is a string.

bsh % str3 = new StringBuffer("This is a string.");

bsh % print(str3);
This is a string.

bsh % str4 = str3.replace(8,9,"another");

bsh % print(str4);
This is another string.

bsh % print(str3);
This is another string.

Common Lisp has only mutable strings, but both destructive and non-destructive operations:

cl-user(1): (setf str1 "This is a string.")
"This is a string."

cl-user(2): (setf str2 (substitute #\y #\i str1))
"Thys ys a stryng."

cl-user(3): str2
"Thys ys a stryng."

cl-user(4): str1
"This is a string."

cl-user(5): (setf str2 (nsubstitute #\y #\i str1))
"Thys ys a stryng."

cl-user(6): str2
"Thys ys a stryng."

cl-user(7): str1
"Thys ys a stryng."

A string's length may be static, as is Java's String, dynamic, as is Java's Stringbuffer, or limited dynamic, as Sebesta says C's are [p. 257]. However, the program

#include <stdio.h>
#include <string.h>

#define true 1

int main() {
  char str[10];
  int i;
  while (true) {
    str[i++] = 'a';
    str[i] = '\0';
    printf("str = %s; Its length is %d; i = %d\n", str, (int)strlen(str), i);
  }
  return 0;
}

----------------------------------------------
<pollux:Test:1:27> gcc -Wall -o dstrlen dstrlen.c
<pollux:Test:1:28> ./dstrlen
str = a; Its length is 1; i = 1
str = aa; Its length is 2; i = 2
str = aaa; Its length is 3; i = 3
str = aaaa; Its length is 4; i = 4
str = aaaaa; Its length is 5; i = 5
str = aaaaaa; Its length is 6; i = 6
str = aaaaaaa; Its length is 7; i = 7
str = aaaaaaaa; Its length is 8; i = 8
str = aaaaaaaaa; Its length is 9; i = 9
str = aaaaaaaaaa; Its length is 10; i = 10
str = aaaaaaaaaaa; Its length is 11; i = 11
str = aaaaaaaaaaaa; Its length is 12; i = 12

was an infinite loop when I ran it on pollux. When I killed it, str had a length of 2,089. Of course, this is C not doing range checking on arrays, again, and on pollux' operating system. On timberlake, the string never exceeds a length of 12, and i is incremented modulo 12.

Pattern matching is a common operation on strings that is a very involved subject. A large part of Perl is devoted to pattern matching. Java has an extensive pattern matching capability in the package java.util.regex. C++ also has a pattern matching library. (X)Emacs supports regular expression pattern matching for searching and replacing strings. For example, the regular expression <[^>]*> will match html tags.

User-Defined Types

A user-defined type is a data type with a user-declared name. For example, in C:

#include <stdio.h>

#define MperK 0.62137
#define KperM 1.60935

typedef float kilometer;
typedef float mile;

kilometer MtoK(mile x) {
  return x * KperM;
}

mile KtoM(kilometer x) {
  return x * MperK;
}

int main() {
  mile m = 100;
  kilometer k = 100;
  printf("%3.0f miles = %5.2f kilometers.\n", m, MtoK(m));
  printf("%3.0f kph = %5.2f mph.\n", k, KtoM(k));
  return 0;
}

----------------------------------------------------------------
<timberlake:Test:1:86> gcc -Wall -o conversion conversion.c
<timberlake:Test:1:87> ./conversion
100 miles = 160.93 kilometers.
100 kph = 62.14 mph.

In C, the typedef identifier is a synonym for its parent type. However, that is not true in all languages with user-defined types. If the new type identifier is not a synomym, a question is, is name type compatibility used, or structure type compatibility.

In name type compatibility, two expressions having compatible types depends on the type identifier, even if the parent types are the same. In structure type compatibility, it depends on the parent types. For example, in the Ada-like type declarations

type array1type is array(1..10) of Integer;
type array2type is array(11..20) of Integer;

A: array1type;
B: array2type;

A and B do not have compatible types under name type compatibility, but do under structure type compatibility.

Some languages use name type compatibility, some use structure type compatibility, and some have facilities for both.

If a variable is declared with a type expression, such as

A: array(1..10) of Integer;

the variable is considered to have an anonymous type.

Ordinal Types

An ordinal type is one whose values can be mapped to the natural numbers, such as char. The integer types are also considered ordinal types, although the signed integers also have negatives. The important thing is that, except for the minimal value, every value of an ordinal type is the successor of a value of its type, and, except for the maximal value, every value of an ordinal type is the predecessor of a value of its type. So one should be able to use any ordinal type as an array subscript, or as a for loop index.

Enumeration Types

An enumeration type is an ordinal type whose values are identifiers chosen by the programmer. For example, in C

#include <stdio.h>

enum months {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec};

int monLength[12] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};

char* monName[12] = {"January", "February", "March", "April",
		      "May", "June", "July", "August",
		      "September", "October", "November", "December"};

int main() {
  enum months m;
  for (m = Jan; m <= Dec; m++) {
  printf("%s has %d days.\n", monName[m], monLength[m]);
  }
  return 0;
}

---------------------------------------------------------------
<timberlake:Test:1:89> gcc -Wall -o enumtest enumtest.c

<timberlake:Test:1:90> ./enumtest
January has 31 days.
February has 28 days.
March has 31 days.
April has 30 days.
May has 31 days.
June has 30 days.
July has 31 days.
August has 31 days.
September has 30 days.
October has 31 days.
November has 30 days.
December has 31 days.

As is usual for C, the enumeration type is treated just like int and its values are treated like int values.

In fact, let's try to assign a days value to a months variable in C:

#include <stdio.h>

enum months {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec};

enum days {Sun, Mon, Tue, Wed, Thur, Fri, Sat};

int monLength[12] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};

char* monName[12] = {"January", "February", "March", "April",
		      "May", "June", "July", "August",
		      "September", "October", "November", "December"};

int main() {
  enum months m;
  enum days d = Thur;
  m = d;
  printf("It ran.\n");
  return 0;
}
--------------------------------------------------------------------------
<timberlake:Test:1:123> gcc -Wall -o enumtest2 enumtest2.c
<timberlake:Test:1:124> ./enumtest2
It ran.

C++, though is more careful:

#include <iostream>
#include <string>

using namespace std;

enum months {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec};

enum days {Sun, Mon, Tue, Wed, Thur, Fri, Sat};

int main() {
  enum months m;
  enum days d = Thur;
  m = d;
  printf("It ran.\n");
  return 0;
}

----------------------------------------------------------------
<timberlake:Test:1:93> g++ -Wall -o enumtest enumtest.cpp
enumtest.cpp: In function 'int main()':
enumtest.cpp:19: error: cannot convert 'days' to 'months' in assignment

Unlike previous versions of Java, Java versions 1.5 and later support enumeration types, called Enums. (See also the api for the Enum class.)
Here is an example:

public class Months {
    public enum Month {January, February, March, April, May, June,
	    July, August, September, October, November, December}

    public static int[] monLength = {31, 28, 31, 30, 31, 30,
				     31, 31, 30, 31, 30, 31};

    public static void main(String[] args) {
	Month f = Month.February;
	System.out.println(f + " is the shortest month. A full list of months and lengths is:");
	for (Month m : Month.values()) {
	    System.out.println(m + " has " + monLength[m.ordinal()] + " days.");
	}
    } // end of main()
} // Months
-------------------------------------------------
<timberlake:Test:1:96> javac Months.java

<timberlake:Test:1:97> java Months
February is the shortest month. A full list of months and lengths is:
January has 31 days.
February has 28 days.
March has 31 days.
April has 30 days.
May has 31 days.
June has 30 days.
July has 31 days.
August has 31 days.
September has 30 days.
October has 31 days.
November has 30 days.
December has 31 days.

Subrange Types

A subrange type is a consecutive set of values of some ordinal type. They can be used for subranges of enumeration types. Here's an Ada example similar to Sebesta's:

type Days is (Mon, Tue, Wed, Thu, Fri, Sat, Sun)

subtype WeekDays is Days range Mon..Fri;
sybtype WeekendDays is Days range Sat..Sun;

Day1: Days;
Day2: WeekDays;
Day3: WeekendDays;

Day1 := Day2 and Day1 := Day3 are legal.
Day2 := Day3 and Day3 := Day2 are illegal.
Day2 := Day1 or Day3 := Day1 are only legal if Day1 has a proper value at run-time.

Subrange types are particularly useful for the indexes of arrays, such as

subtype arrayIndex is Integer range 1..100;
squares: array(arrayIndex) of Integer;

and for the indexes of for loops, such as

for i in arrayIndex loop
  squares[i] := i*i;
end loop;

Symbols

Several languages have a data type whose values are spelled like identifiers, may be lexicographically ordered, but do not have the successor/predecessor operations. Think of the values of an enumerated type without those operations. In Common Lisp and Ruby, these are called "symbols"; in Erlang, "names"; in Prolog, "atoms".

In this Prolog program, the names of the months are atoms. If one were capitalized, it would be a variable.

monthLength(january, 31).
monthLength(february, 28).
monthLength(march, 31).
monthLength(april, 30).
monthLength(may, 31).
monthLength(june, 30).
monthLength(july, 31).
monthLength(august, 31).
monthLength(september, 30).
monthLength(october, 31).
monthLength(november, 30).
monthLength(december, 31).

:- monthLength(M, L), format("~a has ~d days.\n", [M,L]), fail;
	halt.
----------------------------------------------------------
<timberlake:Test:1:98> prolog -l months.pro
% compiling /projects/shapiro/CSE305/Test/months.pro...
january has 31 days.
february has 28 days.
march has 31 days.
april has 30 days.
may has 31 days.
june has 30 days.
july has 31 days.
august has 31 days.
september has 30 days.
october has 31 days.
november has 30 days.
december has 31 days.
% compiled /projects/shapiro/CSE305/Test/months.pro in module user, 0 msec 2752 bytes

The Common Lisp symbol type is a data type whose values are all the Common Lisp identifiers.

cl-user(1): (type-of '3)
fixnum

cl-user(2): (type-of '3.7)
single-float

cl-user(3): (type-of 'January)
symbol

cl-user(4):(setf monLength (make-hash-table))
#<eql hash-table with 0 entries @ #x4ec5882>

cl-user(5): (mapc #'(lambda (key value) (setf (gethash key monLength) value))
                  '(January February March April May June July August September
	            October November December )
                  '(31 28 31 30 31 30 31 31 30 31 30 31))
(January February March April May June July August September October ...)

cl-user(6): (loop for m = (progn
 		               (format t "Enter a month or `bye': ")
		               (read))
                  if (eq m 'bye) return 'Goodbye
                  do (format t "~A has ~D days.~%" m (gethash m monLength)))

Enter a month or `bye': March
March has 31 days.
Enter a month or `bye': June
June has 30 days.
Enter a month or `bye': bye
Goodbye

A symbol is like an OO object; among other instance variable-like components are its name, value, and function:

cl-user(27): (setf Fibonacci 11235)
11235

cl-user(28): (defun Fibonacci (n)
	       (if (< n 3)
		   1
		 (+ (Fibonacci (- n 1))
		    (Fibonacci (- n 2)))))
Fibonacci

cl-user(29): (symbol-name 'Fibonacci)
"Fibonacci"

cl-user(30): (symbol-value 'Fibonacci)
11235

cl-user(31): Fibonacci
11235

cl-user(32): (symbol-function 'Fibonacci)
#<Interpreted Function Fibonacci>

cl-user(33): (type-of (symbol-function 'Fibonacci))
function

cl-user(34): (Fibonacci 10)
55

Array Types

An array is an aggregate of data values, called elements of the array, with the following properties:

It is a homogeneous aggregate. That is, all the data values are of the same type.
The data values can be randomly accessed. That is, access to any element is just as fast as to any other.
The elements are accessed via a sequence of one or more indexes (subscripts), which are values of some ordinal type.
The values of the subscripts may be computed at run-time.

The ability to compute subscripts makes a subscripted array like a variable name that can be computed. More precisely, a subscripted array is an expression evaluated for its l-value. Compare these two C subroutines for the Fibonacci sequence:

#include <stdio.h>

int fibonacci(int n) {
  if (n<=2) return 1;
  int current = 1, 
    oneBack = 1,
    twoBack = 1,
    i;
  for (i=3; i<=n; i++) {
    twoBack = oneBack;
    oneBack = current;
    current = oneBack + twoBack;
  }
  return current;
}

int Fibonacci (int n) {
  if (n<=2) return 1;
  int num[3] = {1,1},
    current = 1,
      i;
    for (i=3; i<=n; i++) {
      current = (current + 1) % 3;
      num[current] = num[(current + 1) % 3] + num[(current + 2) % 3];
    }
    return num[current];
}

int main() {
  int i;
  for (i=1; i<=8; i++)
    printf("fibonacci(%d) = %d\n", i, fibonacci(i));
  printf("\n");
  for (i=1; i<=8; i++)
    printf("Fibonacci(%d) = %d\n", i, Fibonacci(i));
  return 0;
}
------------------------------------------------
<timberlake:Test:1:29> gcc -Wall -o indexdemo indexdemo.c 
<timberlake:Test:1:30> ./indexdemo
fibonacci(1) = 1
fibonacci(2) = 1
fibonacci(3) = 2
fibonacci(4) = 3
fibonacci(5) = 5
fibonacci(6) = 8
fibonacci(7) = 13
fibonacci(8) = 21

Fibonacci(1) = 1
Fibonacci(2) = 1
Fibonacci(3) = 2
Fibonacci(4) = 3
Fibonacci(5) = 5
Fibonacci(6) = 8
Fibonacci(7) = 13
Fibonacci(8) = 21

An array can be thought of as a mapping, or even a function. For example, the C array monLength, above, is a mapping from a month's ordinal, 0..11, to its length. This is clearer in the Java expression, above, monLength[m.ordinal()]. The Common Lisp use of monLength is more directly represented as a mapping. An array might also be thought of as a function from a month's ordinal to its length.

Most current programming languages use parentheses around the arguments of a function, e.g. f(x), and brackets around the subscripts of an array, e.g. a[i], but Fortran and Ada use parentheses for arrays also. Thinking of an array as a function justifies this, but most programmers find it confusing.

Common Lisp, as usual uses a more functional notation:

cl-user(33): (setf a (make-array 10))
#(nil nil nil nil nil nil nil nil nil nil)

cl-user(34): (setf days #(Sun Mon Tue Wed Thu Fri Sat))
#(Sun Mon Tue Wed Thu Fri Sat)

cl-user(35): (aref days 3)
Wed

cl-user(36): (setf (aref a 2) 5)
5

cl-user(37): a
#(nil nil 5 nil nil nil nil nil nil nil)

Some programming languages, including Java and Common Lisp, do range-checking. That is, they give a run-time error if the program tries to use an out-of-range subscript. Others, including C, Perl, and Fortran, do not. A programming language that does range checking is clearly more reliable.

Some programming languages have a fixed lowest subscript: in C-based languages, it is 0; in Fortran, it is 1. Others allow the programmer to choose the lowest subscript.

The array subscript range might be statically bound (during compile-time); dynamically bound (during run-time), but then fixed; or fully dynamic (might change during run-time).

Array storage binding might be static, stack-dynamic, or heap-dynamic.

Some languages provide a convenient way to initialize arrays, such as the C-based languages,

int[] squares = {0, 1, 2, 9, 16, 25};

However, one must distinguish whether the {...} notation is a general array-valued constructor, allowed on the rhs of assignment statements, or only a special syntax for declaration statements.

Some languages provide array operations, i.e., operations on arrays themselves. For example, in Fortran:

      Program arrayop

      Integer A1(5), A2(5), A3(5), A4(5)
      Data A1 /1, 2, 3, 4, 5/ A2 /6, 7, 8, 9, 10/
      A3 = A1 + A2
      A4 = A1 * A2

      Print *, A1
      Print *, A2
      Print *, A3
      Print *, A4
      End

------------------------------------
<timberlake:Test:1:34> f95 -o arrayop arrayop.f
<timberlake:Test:1:35> ./arrayop
           1           2           3           4           5
           6           7           8           9          10
           7           9          11          13          15
           6          14          24          36          50

APL is A Programming Language specially designed to operate on arrays.

Two-dimensional arrays may be thought of as solid rectangles (rectangular arrays), or as arrays of arrays (jagged arrays). Some languages insist the programmer think of arrays one way, some the other, and some support both.
Rectangular arrays are indexed with one pair of brackets, such as a[i, j].
Jagged arrays are indexed with two pairs of brackets, such as a[i][j].

Java supports only jagged arrays:

bsh % int[][] a = new int[3][4];

bsh % print(a.length);
3

bsh % print(a[1].length);
4

Note that a is a 3-element array of 4-element arrays. It is usual to also think of this as 3 rows of 4 columns each:

bsh % for (int i=0; i<3; i++) for (int j=0; j<4; j++) a[i][j] = 10*i+j;

bsh % for (int i=0; i<3; i++) {
	for (int j=0; j<4; j++) {System.out.print(a[i][j] + " ");}
	System.out.println();}
0 1 2 3 
10 11 12 13 
20 21 22 23

An array stored so that all the elements of the first row are stored before all the elements of the second row, etc. is referred to as stored in row major order.
We can see this clearly in C:

#include <stdio.h>

int a[3][4];

int main() {
  int i,j;

  for (i=0; i<3; i++) {
    for (j=0; j<<4; j++) {
      a[i][j] = 10*i + j;
    }
  }

  for (i=0; i<12; i++) {
    printf("%3d", *(a + i));}

  printf("\n");
  return 0;
}

--------------------------------------
<timberlake:Test:1:41> gcc -Wall -o arrayorder arrayorder.c
<timberlake:Test:1:42> ./arrayorder
  0  1  2  3 10 11 12 13 20 21 22 23

This shows that C stores arrays in row major order.

Let's try Fortran:

      Program arrayorder

      Integer A(3,4), B(12)
      Equivalence ( A(1,1), B(1) )

      Do 50 i = 1, 3
         Do 50 j = 1, 4
            A(i,j) = 10*i + j
 50   Continue

      Print *, B
      End

-----------------------------------------
<timberlake:Test:1:44> f95 -o arrayorder arrayorder.f
<timberlake:Test:1:45> ./arrayorder
          11          21          31          12          22          32          13          23          33          14          24          34

(Note that Equivalence is deprecated in Fortran 90 and later versions.) Fortran stores arrays in column major order. Since Fortran and C programs can easily call each other, this is an important difference.

Jagged arrays needn't have every row have the same number of columns.

The entire discussion of two-dimensional arrays extends to multi-dimensional arrays.

Fortran 95, Ada, Python, and Ruby allow references to a slice of an array---a more or less regular piece of an array.
Here's a small Python example:

<timberlake:Test:1:46> python
Python 2.6.4 (r264:75706, Dec 21 2009, 12:37:31) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = [0,1,2,3,4,5,6,7,8,9]
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> a[3:7]
[3, 4, 5, 6]
>>> b = [13,14,15,16]
>>> a[3:7] = b
>>> a
[0, 1, 2, 13, 14, 15, 16, 7, 8, 9]
>>>

Tuples

Erlang, Haskell, and Python have a tuple data type. Like arrays, tuples have fixed lengths, but they are immutable and heterogeneous---they may have elements of different types. Erlang's and Python's tuples are indexed by integers.

Associative Arrays

Associative arrays, also called maps in C++, hash tables in Common Lisp, Maps in Java, hashes in Perl and Ruby, and dictionaries in Python are generalizations of arrays for which the "index" can be any type. The "index" is called a key, and the element stored with the key is called the value. Here is a use of Python's dictionaries to print the length of all the months:

#! /util/bin/python

months = ("January", "February", "March", "April", "May", "June",
	   "July", "August", "September", "October", "November",
	   "December");

monLength = {"January":30, "February":28, "March":31, "April":30,
	      "May":31, "June":30, "July":31, "August":31,
	      "September":30, "October":31, "November":30,
              "December":31};

for month in months:
  print "%s has %d days." % (month, monLength[month])
-----------------------------------------------------
<timberlake:Test:1:167> python months.py
January has 30 days.
February has 28 days.
March has 31 days.
April has 30 days.
May has 31 days.
June has 30 days.
July has 31 days.
August has 31 days.
September has 30 days.
October has 31 days.
November has 30 days.
December has 31 days.

Record Types

Records, first introduced in COBOL, may be thought of as primitive object classes:

They have instance variables.
They have set and get methods.
They have no other methods.
They do not support inheritance.

C and C++ calls them structs. Common Lisp calls them structures. Note that C++ and Common Lisp have true, modern, objects as well. See Sebesta for more details.

Union Types

A semi-organized way to allow some variables to be different types at different times, even though they are statically typed. Not very safe. See Sebesta.

Pointer and Reference Types

The set of data objects in the pointer type is the set of memory addresses plus nil, which is an explicitly invalid address. That is, the value bound to a variable whose type is a pointer type is either a memory address or nil.

It is most common for a pointer variable to be an address of a memory cell in the heap, but C and C++ also allow addresses in RAM or on the stack.

Fortran 77 (and earlier) does not have pointer types, but they can be simulated by using one array for data and a separate array of indices into the first array as the pointers.

How can a pointer variable contain an address in RAM or on the stack? Addresses in RAM or on the stack are allocated when variables are declared. If ptr is a pointer variable, we want: ptr := <expression>, but <expression> would be evaluated for its r-value. So we need something that says "evaluate this expression for its l-value." In C and C++ that operator is &, and its operand must be an expression that could be on the left-hand side of an assignment statement.

In statically scoped languages, the declaration of a pointer variable must include the type of variable it points to.

If x is a variable and ptr is a pointer variable, what is the meaning of x := ptr?

If x is also a pointer variable, it's a simple assignment statement.
If x is not a pointer variable, it's either an error or the compiler must know that ptr is to be dereferenced. C and C++ use * as an explicit dereferencing operator. Fortran 95 does implicit dereferencing.

Here's a C program using a pointer whose value is an address in the stack:

#include <stdio.h>
int* ptr;

void sub1() {
  int x, y;
  x = 3;
  ptr = &x;
  y = *ptr;
  printf("x = %2d; y = %2d.\n", x, y);
}

void sub2() {
  int z = 5;
  printf("z = %2d.\n", z);
}

void sub3() {
  printf("*ptr = %2d.\n", *ptr);
}

int main() {
  sub1();
  sub2();
  sub3();
  return 0;
}

--------------------------------------------
<timberlake:Test:1:27> gcc -Wall -o pointerTest pointerTest.c 
<timberlake:Test:1:28> ./pointerTest
x =  3; y =  3.
z =  5.
*ptr =  5.

Notice that ptr contains a pointer to the memory cell on the stack that was first occupied by sub1's x, but later was occupied by sub2's z.

Here is an example in Fortran 95, showing implicit dereferencing:

      Program pointerTest

      Integer, Pointer :: ptr
      Integer, Target :: x
      Integer :: y

      x = 3
      ptr => x
      y = ptr
      x = 5
      Print *, "x = ", x, "y = ", y
      End

-----------------------------------------------------
<timberlake:Test:1:29> f95 -o pointerTest pointerTest.f
<timberlake:Test:1:30> ./pointerTest
 x =            5 y =            3

The fact that y has the value 3 shows that ptr was dereferenced before a value was stored into y.

Pointer arithmetic is allowed in C and C++. If ptr is of type typ *, and i is of type int, the expression ptr + i evaluates to the address i*sizeof(typ) beyond ptr.

In C and C++, an array name is a constant pointer to the first element of the array, so subscripting is done by pointer arithmetic, and pointer expressions may replace subscripted arrays.

Anonymous variables on the heap are manipulated via pointers. The allocation operators new, in Java and C++, and malloc(size), in C, return pointers to the newly allocated heap memory.

Many novice C programmers find pointers to be confusing, but "if everything is a pointer, you don't have to think about pointers," and that is the approach taken by Erlang, Haskell, Lisp, Java, Prolog, Python, and Ruby. In those languages, you can think you are storing an object (or, at worst, a reference to an object) in a variable. You just have to remember that a change made via one reference variable may be seen via another reference variable.

The dangling pointer problem is the problem of a pointer variable, in scope and during its lifetime, pointing to a memory cell that was already deallocated, perhaps via another pointer variable (and possibly even reused).

This C program shows that a pointer may be mistakenly used, even though the space it points to has been deallocated:

#include <stdio.h>
#include <malloc.h>

int* ptr;

int main() {
  ptr = malloc(sizeof(int));
  *ptr = 3;
  free(ptr);
  printf("*ptr = %2d\n", *ptr);
  return 0;
}

---------------------------------------------------
<pollux:Test:1:27> gcc -Wall -o danglingTest danglingTest.c 
 ./danglingTest 
*ptr =  3

(timberlake printed *ptr = 0.)

The dangling pointer problem is commonly solved by removing explicit deallocation from the programmer, and using automatic garbage collection instead.

The problem of memory leakage is the problem of memory cells allocated on the heap becoming unreachable (becoming garbage) when the pointer variables referring to them end their lifetime or get reassigned to other heap memory. This problem is also solved by automatic garbage collection.

Lists

A list is a recursive collection that is either

empty;
or consists of a head (or first) element of any data type and a tail (or rest) part which is also a list.

The operation of constructing a list with some element as a head and some list as a tail is usually called cons. In Lisp:

cl-user(9): (cons 'a '(b c d e f))
(a b c d e f)

Lists are a native data type in Erlang, Haskell, Lisp, Prolog, and Python.
Lists are immutable in Erlang, Haskell, and Prolog;
mutable in Lisp and Python;
homogeneous in Haskell;
heterogeneous in Erlang, Lisp, Prolog, and Python.

Functions

Based on the lambda calculus of logician Alonzo Church, Lisp can express a function as (lambda (x y) (= (mod x y) 0)).

cl-user(6): (type-of (lambda (x y) (= (mod x y) 0)))
function

A lambda expression, being a function, can be the first element of a list, with the following elements being its arguments.

cl-user(13): ((lambda (x y) (= (mod x y) 0)) 81 3)
t

Lisp's apply is a function that takes a function and a list of arguments, and applies the function to the arguments.

cl-user(14): (apply (lambda (x y) (= (mod x y) 0)) '(48 6))
t

Lisp's funcall is a function that takes a function and a sequence of arguments, and applies the function to the arguments.

cl-user(33): (funcall (lambda (x y) (= (mod x y) 0)) 48 6)
t

When you define a function in Lisp, the function is put in the symbol-function cell of the name of the function.

cl-user(15): (defun fact (n) (if (< n 2) 1 (* n (fact (1- n)))))
fact
cl-user(16): (fact 4)
24
cl-user(17): (symbol-function 'fact)
#<Interpreted Function fact>
cl-user(18): (compile 'fact)
fact
nil
nil
cl-user(19): (symbol-function 'fact)
#<Function fact>
cl-user(34): (funcall (symbol-function 'fact) 4)
24

A function can compute a function

cl-user(35): ((lambda (x) (lambda (y) (= (mod x y) 0))) 312)
#<Interpreted Closure (:internal (:internal nil)) @ #x71c22642>
cl-user(37): (type-of ((lambda (x) (lambda (y) (= (mod x y) 0))) 312))
function

If that's a function, we can use funcall on it

cl-user(40): (funcall ((lambda (x) (lambda (y) (= (mod x y) 0))) 312) 3)
t

Representing a function of two arguments as a function of one argument whose value is a function of one argument (and similarly for more than one argument) is called "currying", after the logician Haskell Curry. The type of a function from type t₁ to type t₂ can be expressed as t₁ -> t₂. So t₁ -> (t₂ -> t₃) is the type of a function from type t₁ to a function of type t₂ -> t₃, which is the curried form of a function of two arguments, one of type t₁ and one of type t₂ to a result of type t₃.

Haskell represents its functions in the curried form:

<timberlake:Test:1:35> ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.

Prelude> let divby x y = ((mod x y) == 0)

Prelude> divby 14 3
False

Prelude> divby 4551 3
True

Prelude> :t divby
divby :: (Integral a) => a -> a -> Bool

Prelude> ^d
Leaving GHCi.

There is a function data type in Erlang, Haskell, Lisp, Python.
There are lambda expressions in Ruby, but their class is Proc.

CSE 305 Programming Languages Lecture Notes Stuart C. Shapiro

Data Types

Last modified: Wed Feb 10 13:21:43 2010 Stuart C. Shapiro <shapiro@cse.buffalo.edu>

CSE 305
Programming Languages
Lecture Notes
Stuart C. Shapiro

Last modified: Wed Feb 10 13:21:43 2010
Stuart C. Shapiro <shapiro@cse.buffalo.edu>