Character String Types
A character string type is one in which the values consist of sequences of characters. Character string constants are used to label output, and the input and output of all kinds of data are often done in terms of strings. Of course, character strings also are an essential type for all programs that do character manipulation.
The two most important design issues that are specific to character string types are the following:
• Should strings be simply a special kind of character array or a primitive type?
• Should strings have static or dynamic length?
Strings and Their Operations
The most common string operations are assignment, catenation, substring reference, comparison, and pattern matching.
A substring reference is a reference to a substring of a given string. Substring references are discussed in the more general context of arrays, where the substring references are called slices.
In general, both assignment and comparison operations on character strings are complicated by the possibility of string operands of different lengths. For example, what happens when a longer string is assigned to a shorter string, or vice versa? Usually, simple and sensible choices are made for these situations, although programmers often have trouble remembering them. Pattern matching is another fundamental character string operation. In some languages, pattern matching is supported directly in the language. In others, it is provided by a function or class library.
If strings are not defined as a primitive type, string data is usually stored in arrays of single characters and referenced as such in the language. This is the approach taken by C and C . C and C use char arrays to store character strings. These languages provide a collection of string operations through standard libraries. Many uses of strings and many of the library functions use the convention that character strings are terminated with a special character, null, which is represented with zero. This is an alternative to maintaining the length of string variables. The library operations simply carry out their operations until the null character appears in the string being operated on. Library functions that produce strings often supply the null character. The character string literals that are built by the compiler also have the null character. For example, consider the following declaration:
In this example, str is an array of char elements, specifically apples0, where 0 is the null character.
Some of the most commonly used library functions for character strings in C and C are strcpy, which moves strings; strcat, which catenates one given string onto another; strcmp, which lexicographically compares (by the order of their character codes) two given strings; and strlen, which returns the number of characters, not counting the null, in the given string. The parameters and return values for most of the string manipulation functions are char pointers that point to arrays of char. Parameters can also be string literals.
The string manipulation functions of the C standard library, which are also available in C , are inherently unsafe and have led to numerous programming errors. The problem is that the functions in this library that move string data do not guard against overflowing the destination. For example, consider the following call to strcpy:
If the length of dest is 20 and the length of src is 50, strcpy will write over the 30 bytes that follow dest. The point is that strcpy does not know the length of dest, so it cannot ensure that the memory following it will not be overwritten. The same problem can occur with several of the other functions in the C string library. In addition to C-style strings, C also supports strings through its standard class library, which is also similar to that of Java. Because of the insecurities of the C string library, C programmers should use the string class from the standard library, rather than char arrays and the C string library.
In Java, strings are supported by the String class, whose values are constant strings, and the StringBuffer class, whose values are changeable and are more like arrays of single characters. These values are specified with methods of the String Buffer class. C# and Ruby include string classes that are similar to those of Java.
Python includes strings as a primitive type and has operations for substring reference, catenation, indexing to access individual characters, as well as methods for searching and replacement. There is also an operation for character membership in a string. So, even though Python’s strings are primitive types, for character and substring references, they act very much like arrays of characters. However, Python strings are immutable, similar to the String class objects of Java.
In F#, strings are a class. Individual characters, which are represented in Unicode UTF-16, can be accessed, but not changed. Strings can be catenated with the operator. In ML, string is a primitive immutable type. It uses ^ for its catenation operator and includes functions for substring referencing and getting the size of a string.