A distributed operating system is the logical aggregation of operating system software over a collection of independent, networked, communicating, and physically separate computational nodes. Individual nodes each hold a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisions. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node’s hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

A distributed operating system (OS) is an operating system is a generalization of a traditional operating system. A distributed OS provides the essential services and functionality required of an OS, adding attributes and particular configurations to allow it to support additional requirements such as increased scale and availability. To a user, a distributed OS works in a manner similar to a single-node, monolithic operating system. That is, although it consists of multiple nodes, it appears to users and applications as a single-node.
The kernel

At each locale (typically a node), the kernel provides a minimally complete set of node-level utilities necessary for operating a node’s underlying hardware and resources. These mechanisms include allocation, management, and disposition of a node’s resources, processes, communication, and input/output management support functions.Within the kernel, the communications sub-system is of foremost importance for a distributed OS.
In a distributed OS, the kernel often supports a minimal set of functions, including low-level address space management, thread management, and inter-process communication (IPC). A kernel of this design is referred to as a microkernel.Its modular nature enhances reliability and security, essential features for a distributed OS.It is common for a kernel to be identically replicated over all nodes in a system and therefore that the nodes in a system use similar hardware.The combination of minimal design and ubiquitous node coverage enhances the global system's extensibility, and the ability to dynamically introduce new nodes or services.

System management components overview
System management components

System management components are software processes that define the node's policies. These components are the part of the OS outside the kernel. These components provide higher-level communication, process and resource management, reliability, performance and security. The components match the functions of a single-entity system, adding the transparency required in a distributed environment.
The distributed nature of the OS requires additional services to support a node's responsibilities to the global system. In addition, the system management components accept the "defensive" responsibilities of reliability, availability, and persistence. These responsibilities can conflict with each other. A consistent approach, balanced perspective, and a deep understanding of the overall system can assist in identifying diminishing returns. Separation of policy and mechanism mitigates such conflicts.
Working together as an operating system

The architecture and design of a distributed operating system must realize both individual node and global system goals. Architecture and design must be approached in a manner consistent with separating policy and mechanism. In doing so, a distributed operating system attempts to provide an efficient and reliable distributed computing framework allowing for an absolute minimal user awareness of the underlying command and control efforts.
The multi-level collaboration between a kernel and the system management components, and in turn between the distinct nodes in a distributed operating system is the functional challenge of the distributed operating system. This is the point in the system that must maintain a perfect harmony of purpose, and simultaneously maintain a complete disconnect of intent from implementation. This challenge is the distributed operating system's opportunity to produce the foundation and framework for a reliable, efficient, available, robust, extensible, and scalable system. However, this opportunity comes at a very high cost in complexity.
The price of complexity

In a distributed operating system, the exceptional degree of inherent complexity could easily render the entire system an anathema to any user. As such, the logical price of realizing a distributed operation system must be calculated in terms of overcoming vast amounts of complexity in many areas, and on many levels. This calculation includes the depth, breadth, and range of design investment and architectural planning required in achieving even the most modest implementation.
These design and development considerations are critical and unforgiving. For instance, a deep understanding of a distributed operating system’s overall architectural and design detail is required at an exceptionally early point.An exhausting array of design considerations are inherent in the development of a distributed operating system. Each of these design considerations can potentially affect many of the others to a significant degree. This leads to a massive effort in balanced approach, in terms of the individual design considerations, and many of their permutations. As an aid in this effort, most rely on documented experience and research in distributed components.
Node organization in different computing models
A distributed operating system’s hardware elements spread across multiple locations within a rack, or around the world. Distributed configurations allow functions to be distributed as well as decentralized. The specific manner of and relative degree in linkage between the elements, or nodes in the systems differentiates the two. The linkages between the two are the lines of communication between the nodes of the system.
Three basic distributions

To better illustrate this point, examine three system architectures; centralized, decentralized, and distributed. In this examination, consider three structural aspects: organization, connection, and control. Organization describes a system's physical arrangement characteristics. Connection covers the communication pathways among nodes. Control manages the operation of the earlier two considerations.

A centralized system has one level of structure, where all constituent elements directly depend upon a single control element. A decentralized system is hierarchical. The bottom level unites subsets of a system’s entities. These entity subsets in turn combine at higher levels, ultimately culminating at a central master element. A distributed system is a collection of autonomous elements with no concept of levels.

Centralized systems connect constituents directly to a central master entity in a hub and spoke fashion. A decentralized system (aka network system) incorporates direct and indirect paths between constituent elements and the central entity. Typically this is configured as a hierarchy with only one shortest path between any two elements. Finally, the distributed operating system requires no pattern; direct and indirect connections are possible between any two elements. Consider the 1970s phenomena of “string art” or a spirograph drawing as a fully connected system, and the spider’s web or the Interstate Highway System between U.S. cities as examples of a partially connected system.

Centralized and decentralized systems have directed flows of connection to and from the central entity, while distributed systems communicate along arbitrary paths. This is the pivotal notion of the third consideration. Control involves allocating tasks and data to system elements balancing efficiency, responsiveness and complexity.
Centralized and decentralized systems offer more control, potentially easing administration by limiting options. Distributed systems are more difficult to explicitly control, but scale better horizontally and are offer fewer points of system failure. The associations conform to the needs imposed by its design but not by organizational limitations.
Design considerations


Transparency or single-system image refers to the ability of an application to treat the system on which it operates without regard to whether it is distributed and without regard to hardware or other implementation details. Many areas of a system can benefit from transparency, including access, location, performance, naming, and migration. The consideration of transparency directly effects decision making in every aspect of design of a distributed operating system. Transparency can impose certain requirements and/or restrictions on other design considerations.

Location transparency—Location transparency comprises two distinct aspects of transparency, naming transparency and user mobility. Naming transparency requires that nothing in the physical or logical references to any system entity should expose any indication of the entity's location, or its local or remote relationship to the user or application. User mobility requires the consistent referencing of system entities, regardless of the system location from which the reference originates.
Access transparency—Local and remote system entities must remain indistinguishable when viewed through the user interface. The distributed operating system maintains this perception through the exposure of a single access mechanism for a system entity, regardless of that entity being local or remote to the user. Transparency dictates that any differences in methods of accessing any particular system entity—either local or remote—must be both invisible to, and undetectable by the user.
Migration transparency—Resources and activities migrate from one element to another controlled solely by the system and without user/application knowledge or action.
Replication transparency—The process or fact that a resource has been duplicated on another element occurs under system control and without user/application knowledge or intervention.
Concurrency transparency—Users/applications are unaware of and unaffected by the presence/activities of other users.
Inter-process communication

Inter-Process Communication (IPC) is the implementation of general communication, process interaction, and dataflow between threads and/or processes both within a node, and between nodes in a distributed OS. The intra-node and inter-node communication requirements drive low-level IPC design, which is the typical approach to implementing communication functions that support transparency. In this sense, IPC is the greatest underlying concept in the low-level design considerations of a distributed operating system.
Process management

Process management provides policies and mechanisms for effective and efficient sharing of resources between distributed processes. These policies and mechanisms support operations involving the allocation and de-allocation of processes and ports to processors, as well as mechanisms to run, suspend, migrate, halt, or resume process execution. While these resources and operations can be either local or remote with respect to each other, the distributed OS maintains state and synchronization over all processes in the system.
As an example, load balancing is a common process management function. Load balancing monitors node performance and is responsible for shifting activity across nodes when the system is out of balance. One load balancing function is picking a process to move. The kernel may employ several selection mechanisms, including priority-based choice. This mechanism chooses a process based on a policy such as 'newest request'. The system implements the policy
Resource management

systems resources such as memory, files, devices, etc. are distributed throughout a system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithms exist to aid in these decisions; however, this calls for a second level of decision making policy in choosing the algorithm best suited for the scenario, and the conditions surrounding the scenario.

Distributed OS can provide the necessary resources and services to achieve high levels of reliability, or the ability to prevent and/or recover from errors. Faults are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults.
The primary methods for dealing with faults include fault avoidance, fault tolerance, and fault detection and recovery. Fault avoidance covers proactive measures taken to minimize the occurrence of faults. These proactive measures can be in the form of transactions, replication and backups. Fault tolerance is the ability of a system to continue operation in the presence of a fault. In the event, the system should detect and recover full functionality. In any event, any actions taken should make every effort to preserve the single system image.

Availability is the fraction of time during which the system can respond to requests.

Many benchmark metrics quantify performance; throughput, response time, job completions per unit time, system utilization, etc. With respect to a distributed OS, performance most often distills to a balance between process parallelism and IPC.Managing the task granularity of parallelism in a sensible relation to the messages required for support is extremely effective.Also, identifying when it is more beneficial to migrate a process to its data, rather than copy the data, is effective as well

Cooperating concurrent processes have an inherent need for synchronization, which ensures that changes happen in a correct and predictable fashion. Three basic situations that define the scope of this need:
• one or more processes must synchronize at a given point for one or more other processes to continue,
• one or more processes must wait for an asynchronous condition in order to continue,
• or a process must establish exclusive access to a shared resource.
Improper synchronization can lead to multiple failure modes including loss of atomicity, consistency, isolation and durability, deadlock, livelock and loss of serializability.

Flexibility in a distributed operating system is enhanced through the modular and characteristics of the distributed OS, and by providing a richer set of higher-level services. The completeness and quality of the kernel/microkernel simplifies implementation of such services, and potentially enables service providers greater choice of providers for such services.
Let us assume that the model of a distributed operating system is constructed for a heterogeneous local computer network with N nodes, where a heterogeneous network is a network which connects different computers, different peripherals, and there are different types of admissible operations on resources of an operating system. No restrictions have been imposed on a topology of the network. It has been assumed that the topology does not impose restrictions on several nodes, i.e., all nodes are equaly privileged and can carry on any functions of the operating system.
Based on definitions of the object model of a centralized operating system the
construction problems of a distributed operating system:
- access to remote resources,
- management of network resources,
- process synchronization,
- protection and reliability of an operating system
mentioned above can be stated as follows: new logical resources, which should be defined to develop an effective distributed operating system, are not known; an effective structure of the distributed operating system, Le., connections between processes managing resources and distribution,of processes in the network, are not known. The problems given above are complicated by the fact that a set of admissible operations on a resource could be implemented as a set of connected concurrent processes located in different nodes of a network. In the complex problem of defining an effective distributed operating system, it is possible to exhaust much more basic problem? Some new resources of the distributed operating system are known and a definition of managing processes for them makes it possible a development of a model of a distributed operating system which could be treated as a basis for further research. Such a new type of a logical resource introduced by a computer network are messages used in communication and interprocess synchronization [Moo 82, Tan 85]. The communication could be carried out between operating system processes, between user processes and between an operating system process and a user process. Message passing requires managing additional physical resources, which are not known in centralized operating systems, i.e., communication interface, and creation (maybe) of additionallogical resources to perform that message passing in an effective way. Messages are sent in the network between two logically addressed units (e.g., processes, ports connected to processes). The reliability requirements and a need for a dynamically ballanced load of a network can imply that addresses of communicating processes (addresses of network nodes where these communicating processes run) are not constant. So, management of message passing requires system information about the present locations of message receivers. That need generates the second new type of logical resources of the distributed operating system. This type is data structures describing the location of resources in the network. These data have to describe the location of all logical and physical resources (processes managing resources) known by user processes and / or processes of the operating system. The following could be treated as example solutions of the problem of the distribution of data structures centralized (known in one network node) description of the location of resources in the network, distributed description according to classes of resources, local resources known in each network node, the location of all resources known in each node. Models of different methods of resource addressing will be presented in Section 4.
The addition of the new types of resources discussed above to the operating system requires:
(i) the definition of the logical representation of the resources, i.e., the definition of the
data structures describing the location of all resources and the distribution of these data
structures in the network,
(ii) the definition of operations on resources and synchronization of operations,
(iii) the implementation of these operations by managing processes which we call addressing processes, and
(iv) the definition of methods of attaching addressing processes into the the system of connection processes managing the message passing. The method of the definition of these new resources has an influence on the effectiveness of the distributed operating system (measured in a sense of given performance indices such as reaction time on an event, service time of an event, etc.). Searching for the definition of the addressing processes is a part of the much more general problem of the construction of the effective distributed operating system ( in particular a choice of suitable logical resources necessary to allocate physical resources among competing processes in such a way that it is possible to use them effectively.