UNIVERSITY AT BUFFALO, THE STATE UNIVERSITY OF NEW YORK
The Department of Computer Science & Engineering

STUART C. SHAPIRO: CSE 305

CSE 305
Programming Languages
Lecture Notes
Stuart C. Shapiro

Character Sets

40 character BCD: &-0123456789ABCDEFGHIJKLMNOPQR/STUVWXYZ

FORTRAN II 48 character set: 0123456789 ABCDEFGHIJKLMNOPQRSTUVWXYZ =+-*/().,$' blank

128 ASCII (American Standard Code for Information Interchange) character set has 32 non-printing characters and 96 printing characters:
blank !"#$%&`()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\^_`abcdefghijklmnopqrstuvwxyz{|}~ DEL

Unicode character set contains characters for most languages.

Some Character Categories

<digit> -> 0|1|2|3|4|5|6|7|8|9
<lcletter> -> a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
<ucletter> -> A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
<underscore> -> _

Names

Names, or identifiers, are used for variables, subprograms (or methods), types, classes, etc.

Before discussing what a name looks like, we need to discuss what aren't names:

Comments

Fortran77, which is line oriented uses a C or * in column 1 to indicate that the line is a comment.

Other languages typically have one comment symbol to indicate that the rest of the line is a comment, and a pair of brackets to indicate that the enclosing material is a comment. For example,

Language Rest of Line Comment Open Comment Close Comment

bash # none none

C // /* */

C++ // /* */

C# // /* */

Common Lisp ; #| |#

Erlang % none none

Fortran90,95 ! none none

Haskell -- {- -}

Java // /* */

Perl # = =cut

Prolog % /* */

Python # none none

Ruby # =begin =end

(Perl's comment brackets must be at the beginning of a line where a statement would be legal.)

Language	Rest of Line Comment	Open Comment	Close Comment
bash	`#`	none	none
C	`//`	`/*`	`*/`
C++	`//`	`/*`	`*/`
C#	`//`	`/*`	`*/`
Common Lisp	`;`	`#\|`	`\|#`
Erlang	`%`	none	none
Fortran90,95	`!`	none	none
Haskell	`--`	`{-`	`-}`
Java	`//`	`/*`	`*/`
Perl	`#`	`=`	`=cut`
Prolog	`%`	`/*`	`*/`
Python	`#`	none	none
Ruby	`#`	`=begin`	`=end`

There are also common practices, which IDE's and tools are sometimes sensitive to. Such as, in Java, the open comment bracket /** begins a JavaDoc comment. And in Lisp

;;; is used at the beginning of a line, for comments that are outside any function definition;
;; is used at the beginning of an indented line (indented like other lines of code), for comments within a function definition
; is used after, but on the same line as normal code, to comment on that line of code.

Line Boundaries

Many programming languages ignore line boundaries, treating them as whitespace, except for the "rest of line" comment. They include C, C++, Java, Common Lisp, Perl, and Prolog.

Bash, Fortran, Haskell, Python, and Ruby consider the end of a line to be a statement terminator. Some of these have a way to explicitly indicate continuation onto the next line, and a way to indicate that several statements occur on one line.

Whitespace

Whitespace includes spaces, and other characters that act as separators, such as newlines and tabs.

Fortran ignores spaces (blanks). For example in the Do statement,

Do 50 n = 1, 9999

if the comma is omitted, the statement will be interpreted as the assignment statement

Do50n = 19999

Bash uses spaces as separators, especially between a command and its arguments:

bash-2.02$ x=3

bash-2.02$ echo $x
3

bash-2.02$ x = 3
bash: x: command not found

Haskell, Python, and Ruby use indentation at the beginning of the line to indicate a block. This is an example of Python:

if expr:
   print "Block line 1."
   print "Block line 2."
else:
   print "In else block."
print "Out of Block"

Tokens

A token is "a sequence of characters with a unit of meaning." [Wall, Christiansen & Orwant, Programming Perl, p. 49]

Better: A token is a terminal symbol of the programming language, that the reader (parser) passes to the compiler or interpreter.

Punctuation

Punctuation, sometimes called separators, are non-whitespace characters that separate other tokens. They may include parentheses, brackets, and semicolons. For example in the expression a[i], the brackets separate the tokens a and i and prevent the expression from looking like the identifier ai.

In Python, the commas in [1,2,3] separate the elements of the list.

In bash, punction marks are called metacharacters:

metacharacter
A character that, when unquoted, separates words. One of the following:
| & ; ( ) < > space tab

[bash man page]

Operators

Operators include the numeric, relational, boolean, and other operators of the language. For example, the 37 operators of Java are

=	>	<	!	~	?	:
==	<=	>=	!=	&&	||	++	--
+	-	*	/	&	|	^	%
<<	>>	>>>
+=	-=	*=	/=	&=	|=	^=	%=
<<=	>>=	>>>=

Operators usually separate other tokens. For example, in Java, x+y is the same as x + y. In Lisp, however, most of these symbols are ordinary characters, so that while

(+
x y)

is an expression that evaluates to the sum of x and y, (+xy) is a call to the function of no arguments whose name is +xy, and 3+5 is a variable.

Numbers and other literals

Literals are tokens that the compiler recognizes as particular data values. They include numbers such as 5 and 78.34, but many languages have literals of other types, such as the Java boolean literals true and false, and Java's null.

There is generally an involved syntax for numeric literals, including optional signs, decimal points, exponentiation marks, and radix indicators. For example, in C++ and Java 0x57 is a hexadecimal integer equal to the decimal integer 87, and in Lisp, -3745e-2 is a floating point number equal to -37.45. Lisp also has literals of a ratio type, such as 3/5.

Names (Identifiers)

Names, or identifiers, are used for variables, subprograms (or methods), types, classes, etc. Different languages have different rules for the formation of identifiers. In Java,

"An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. An identifier cannot have the same spelling (Unicode character sequence) as a keyword, boolean literal, or the null literal." [The Java Standard, Section 3.8.]

So for Java,

<javaletter> -> <lcletter> | <ucletter> | <underscore>
<identifier> -> <javaletter> {<javaletter> | <digit>}

In Fortran77, a name may only be 1-6 letters and/or digits, the first of which must be a letter. Fortran90 allows names up to 31 characters long, and allows them to include the _ character. "Some processors will allow lower-case as well as upper-case alphabetic characters in names and programs; in such cases, FORTRAN considers the lower-case letter equivalent to its upper-case correspondent" [S. L. Edgar, Fortran for the '90s, W.H. Freeman & Company, 1992, p. 51].
So for Fortran77,

<name> -> <ucletter> {<ucletter> | <digit>}⁵

The default type of a variable beginning with I, J, K, L, M, or N is integer, otherwise, it is real. (The default may be overridden by an explicit type declaration.)

Common Lisp allows names (symbols) to be of arbitrary length, and treats as a name any token that cannot be interpreted as a number. (See the Lisp Hyperspec Section 2.2 Reader Algorithm and Section 2.3.4 Symbols as Tokens) So Lisp names include

1+     /5     ^/-     734ff     89..93

Also, the symbols that are operators in other languages, such as + and > are names in Common Lisp. In fact, Common Lisp treats any character preceded by the escape character \ to be an alphabetic character. So the following are also Lisp names

ab\(c     quo\"te

and even several\ words\ strung\ together, which includes internal spaces. Even the newline character may be included in a Lisp name if preceded by an escape character.

Common Lisp also includes escape brackets:

|several words strung together|

is the same name as

several\ words\ strung\ together

Common Lisp macro characters, when encountered by the reader, cause the reader to call a function that recursively reads the input file, and returns an object as if the reader read that in the first place.

Moreover, Common Lisp puts the attributes of characters in the control of the programmer. For example, the programmer could make ( and ) be considered simple alphabetic characters, and make [ and ] serve the role ( and ) normally do.

Languages also differ about the significance of upper- and lower-case letters. Most modern languages distinguish between them. So HashTable is a different name from Hashtable.

Erlang and Prolog consider a name that starts with an upper-case letter to be a variable, while one that begins with a lower-case letter is considered to be a literal symbol.

In Haskell, a variable must start with a lower-case letter, and then can have mixed lower-case letters, upper-case-letters, digits, and single quote marks. An underscore is considered a lower-case letter.

<digit> -> 0|1|2|3|4|5|6|7|8|9
<small> -> _|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
<large> -> A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
<varid> -> (<small> {<small> | <large> | <digit> | ' })_<reservedid>

In Perl, every variable name must start with a "funny character". The name of a scalar variable, such as one that stores a number or string, must start with a $, such as $x. The name of a variable whose value is an array must start with an @, such as @monthTable. The name of a variable whose value is a hash table, called simply a "hash", must start with a %, such as %addressBook.

The first character of a Ruby variable indicates its scope.

ANSI Common Lisp and versions of ACL before version 6 differentiate upper-case from lower-case letters, but automatically upper-case non-escaped lower-case letters.

Although Emacs-Lisp is not a version of Common Lisp, like current ACL, it differentiates upper- from lower-case letters, and does not change either to the other.

Keywords and Reserved Words

Keywords and reserved words are tokens that look like identifiers, but whose use is restricted. Sebesta distinguishes them by saying that a keyword is restricted in only certain contexts, whereas a reserved word may never be used as an identifier. However, what Java calls keywords would be reserved words by this definition. When starting with a new programming language, finding the list of keywords and reserved words and what their restrictions are is as important as finding out what the comment symbols are.

The bash man page says,

Reserved words
are words that have a special meaning to the shell. The following words are recognized as reserved when unquoted and either the first word of a simple command ... or the third word of a case or for command:
! case do done elif else esac fi for function if in select then until while { } time [[ ]]
...
Note that unlike the metacharacters ( and ), { and } are reserved words ... Since they do not cause a word break, they must be separated from [other words] by whitespace.

CSE 305 Programming Languages Lecture Notes Stuart C. Shapiro

Character Sets

Names

Last modified: Thu Jan 28 11:55:34 EST 2010 Stuart C. Shapiro <shapiro@cse.buffalo.edu>

CSE 305
Programming Languages
Lecture Notes
Stuart C. Shapiro

Last modified: Thu Jan 28 11:55:34 EST 2010
Stuart C. Shapiro <shapiro@cse.buffalo.edu>