Reverse Engineering of Object Oriented Code. Springer

Monographs in Computer Science

Paolo Tonella

Alessandra Potrich

Visit Springer's eBookstore at:
and the Springer Global Website Online at:

e-books shop
Reverse Engineering of Object Oriented Code
Monographs in Computer Science


There has been an ongoing debate on how best to document a software system
ever since the first software system was built. Some would have us writing natural
language descriptions, some would have us prepare formal specifications,
others would have us producing design documents and others would want us
to describe the software thru test cases. There are even those who would have
us do all four, writing natural language documents, writing formal specifications,
producing standard design documents and producing interpretable test
cases all in addition to developing and maintaining the code. The problem
with this is that whatever is produced in the way of documentation becomes
in a short time useless, unless it is maintained parallel to the code. Maintaining
alternate views of complex systems becomes very expensive and highly
error prone. The views tend to drift apart and become inconsistent.
The authors of this book provide a simple solution to this perennial problem.
Only the source code is maintained and evolved. All of the other information
required on the system is taken from the source code. This entails
generating a complete set of UML diagrams from the source. In this way, the
design documentation will always reflect the real system as it is and not the
way the system should be from the viewpoint of the documentor. There can
be no inconsistency between design and implementation. The method used is
that of reverse engineering, the target of the method is object oriented code in
C++, C#, or Java. From the code class diagrams, object diagrams, interaction
diagrams and state diagrams are generated in accordance with the latest
UML standard. Since the method is automated, there are no additional costs.
Design documentation is provided at the click of a button.
This approach, the result of many years of research and development, will
have a profound impact upon the way IT-systems are documented. Besides
the source code itself, only one other view of the system needs to be developed
and maintained, that is the user view in the form of a domain specific language.
Each application domain will have to come up with it’s own language
to describe applications from the view point of the user. These languages may
range from natural languages to set theory to formal mathematical notations.

What these languages will not describe is how the system is or should be constructed.
This is the purpose of UML as a modeling language. The techniques
described in this book demonstrate that this design documentation can and
should be extracted from the code, since this is the cheapest and most reliable
means of achieving this end. There may be some UML documents produced
on the way to the code, but since complex IT systems are almost always developed
by trial and error, these documents will only have a transitive nature.
The moment the code exists they are both obsolete and superfluous. From
then on, the same documents can be produced cheaper and better from the
code itself. This approach coincides with and supports the practice of extreme programming.
Of course there are several drawbacks, as some types of information are
not captured in the code and, therefore, reverse engineering cannot capture
them. An example is that there still needs to be a test oracle – something to
test against. This something is the domain specific specification from which
the application-oriented test cases are derived. The technical test cases can
be derived from the generated UML diagrams. In this way, the system as
implemented will be verified against the system as specified. Without the
UML diagrams, extracted from the code, there would be no adequate basis of comparison.
For these and other reasons, this book is highly recommendable to all
who are developing and maintaining Object-Oriented software systems. They
should be aware of the possibilities and limitations of automated post documentation.
It will become increasing significant in the years to come, as the
current generation of OO-systems become the legacy systems of the future.
The implementation knowledge they encompass will most likely be only in the
source and there will be no other means of regaining it other than through reverse engineering.
Trento, Italy, July 2004
Benevento, Italy, July 2004
Harry Sneed
Aniello Cimitile


Diagrams representing the organization and behavior of an Object Oriented
software system can help developers comprehend it and evaluate the impact of
a modification. However, such diagrams are often unavailable or inconsistent
with the code. Their extraction from the code is thus an appealing option.
This book represents the state of the art of the research in Object Oriented
code analysis for reverse engineering. It describes the algorithms involved
in the recovery of several alternative views from the code and some of the
techniques that can be adopted for their visualization.
During software evolution, availability of high level descriptions is extremely
desirable, in support to program understanding and to change-impact
analysis. In fact, location of a change to be implemented can be guided by
high level views. The dependences among entities in such views indicate the
proportion of the ripple effects.
However, it is often the case that diagrams available during software evolution
are not consistent with the code, or – even more frequently – that no
diagram has altogether been produced. In such contexts, it is crucial to be
able to reverse engineer design diagrams directly from the code. Reverse engineered
diagrams are a faithful representation of the actual code organization
and of the actual interactions among objects. Programmers do not face any
misalignment or gap when moving from such diagrams to the code.
The material presented in this book is based on the techniques developed
during a collaboration we had with CERN (Conseil Européen pour la
Recherche Nucléaire). At CERN, work for the next generation of experiments
to be run on the Large Hadron Collider has started in large advance, since
these experiments represent a major challenge, for the size of the devices,
teams, and software involved. We collaborated with CERN in the introduction
of tools for software quality assurance, among which a reverse engineering tool.
The algorithms described in this book deal with the reverse engineering of
the following diagrams:
Class diagram: Extraction of inter-class relationships in presence of weakly
typed containers and interfaces, which prevent an exact knowledge of the
actual type of referenced objects.
Object and interaction diagrams: Recovery of the associations among
the objects that instantiate the classes in a system and of the messages
exchanged among them.
State diagram: 
Modeling of the behavior of each class in terms of states and state transitions.
Package diagram: Identification of packages and of the dependences among packages.
All the algorithms share a common code analysis framework. The basic
principle underlying such a framework is that information is derived statically
(no code execution) by performing a propagation of proper data in a graph
representation of the object flows occurring in a program. The data structure
that has been defined for such a purpose is called the Object Flow Graph
(OFG). It allows tracking the lifetime of the objects from their creation along
their assignment to program variables.
UML, the Unified Modeling Language, has been chosen as the graphical
language to present the outcome of reverse engineering. This choice was motivated
by the fact that UML has become the standard for the representation
of design diagrams in Object Oriented development. However, the choice of
UML is by no means restrictive, in that the same information recovered from
the code can be provided to the users in different graphical or non graphical formats.
A well known concern of most reverse engineering methods is how to filter
the results, when their size and complexity are excessively high. Since
the recovered diagrams are intended to be inspected by a human, the presentation
modes should take into account the cognitive limitations of humans
explicitly. Techniques such as focusing, hierarchical structuring and element
explosion/implosion will be introduced specifically for some diagram types.
The research community working in the field of reverse engineering has
produced an impressive amount of knowledge related to techniques and tools
that can be used during software evolution in support of program understanding.
It is the authors’ opinion that an important step forward would be
to publish the achievements obtained so far in comprehensive books dealing
with specific subtopics.
This book on reverse engineering from Object Oriented code goes exactly
in this direction. The authors have produced several research papers in this
field over time and have been active in the research community. The techniques
and the algorithms described in the book represent the current state of the art.
Trento, Italy
July 2004
Paolo Tonella
Alessandra Potrich

Reverse engineering aims at supporting program comprehension, by exploiting
the source code as the major source of information about the organization
and behavior of a program, and by extracting a set of potentially useful views
provided to programmers in the form of diagrams. Alternative perspectives
can be adopted when the source code is analyzed and different higher level
views are extracted from it. The focus may either be on the structure, on
the behavior, on the internal states, or on the physical organization of the
files. A single diagram recovered from the code through reverse engineering
is insufficient. Rather, a set of complementary views need to be obtained,
addressing different program understanding needs.

In this chapter, the role of reverse engineering within the life cycle of a
software system is described. The activities of program understanding and
impact analysis are central during the evolution of an existing system. Both
activities can benefit from sources of knowledge about the program such as
reverse engineered diagrams.

The reverse engineering techniques presented in the following chapters are
described with reference to an example program used throughout the book. In
this chapter, this example program is introduced and commented. Then, some
of the diagrams that are the object of the following chapters are provided for
the example program, showing their usefulness from the programmer’s point
of view. The remaining parts of the book contain the algorithmic details on
how to recover them from the source code.

Reverse Engineering
In the life cycle of a software system, the maintenance phase is the largest
and the most expensive. Starting after the delivery of the first version of the
software [35], maintenance lasts much longer than the initial development
phase. During this time, the software will be changed and enhanced over and
over. So it is more appropriate to speak of software evolution with reference
to the whole life cycle, in which the initial development is only a special case
where the existing system is empty.

Software evolution is characterized by the existence of the source code of
the system. Thus, the typical activity in software evolution is the implementation
of a program change, in response to a change request. Changes may
be aimed at correcting the software (corrective maintenance), at adding a
functionality ( perfective maintenance), at adapting the software to a changed
environment (adaptive maintenance), or at restructuring it to make future
maintenance easier ( preventive maintenance) [35].

During software evolution, the most reliable and accurate description of
the behavior of a software system is its source code. In fact, design diagrams
are often outdated or missing at all. Such a valuable information repository
may not directly answer all questions about the system. Reverse engineering
techniques provide a way to extract higher level views of the system,
which summarize some relevant aspects of the computation performed by the
program statements. Reverse engineered diagrams support program comprehension,
as well as restructuring and traceability.

When an existing code base is worked on, the micro-process of program
change can be decomposed into localizing the change, assessing the impact,
and implementing the change. All such activities depend on the knowledge
available about the program to be modified. In this respect, reverse engineering
techniques are a useful support. Reverse engineering tools provide useful
high level information about the system being maintained, thus helping programmers
locate the component to be modified. Moreover, the relationships
(dependencies, associations, etc.) that connect the entities in reverse engineered
diagrams provide indications about the impact of a change. By tracing
such relationships the set of entities possibly affected by a change are obtained.

Object Oriented programming poses special problems to software engineers
during the maintenance phase. Correspondingly, reverse engineering
techniques have to be customized to address them. For example, the behavior
of an Object Oriented program emerges from the interactions occurring among
the objects allocated in the program. The related instructions may be spread
across several classes, which individually perform a very limited portion of
the work locally and delegate the rest of it to others. Reverse engineered diagrams
capture such collaborations among classes/objects, summarizing them
in a single, compact view. However, recovering accurate information about
such collaborations represents a special challenge, requiring major improvements
to the available reverse engineering methods [48, 100].

When a software system is analyzed to extract information about it, the
fundamental choice is between static and dynamic analysis. Dynamic analysis
requires a tracer tool to save information about the objects manipulated and
the methods dispatched during program execution. The diagrams that can
be reverse engineered in this way are partial. They hold valid for a single,
given execution of the program, with given input values, and they cannot be
easily generalized to the behavior of the program for any execution with any

The eLib Program
input. Moreover, dynamic analysis is possible only for complete, executable
systems, while in Object Oriented programming it is typical to produce incomplete
sets of classes that are reused in different contexts. On the contrary,
a static analysis produces results that are valid for all executions and for all
inputs. On the other side, static analyses may be over-conservative. In fact,
it is undecidable to determine if a statically possible path is feasible, i.e., if
there exists an input value allowing its traversal. Static analysis may conservatively
assume that some paths are executable, while they are actually not so.

Consequently, it may produce results for which no input value exists. In the
following chapters, the advantages and disadvantages of the two approaches
will be discussed for each specific diagram, illustrating them on an executable example.

UML (Unified Modeling Language) [7, 69] has become the standard graphical
language used to represent Object Oriented systems in diagrammatic form.
Its specifications have been recently standardized by the Object Management
Group (OMG) [1]. UML has been adopted by several software companies, and
its theoretical aspects are the subject of several research studies. For these reasons,
UML was chosen as the graphical representation that is produced as the
output of the reverse engineering techniques described in this book. However,
the choice of UML is by no means limiting: while the information reverse
engineered from the code can be represented in different graphical (or non
graphical) forms, the basic analysis methods exploited to produce it can be
reused unchanged in alternative settings, with UML replaced by some other
description language.

An important issue reverse engineering techniques must take into account
is usability. Since the recovered views are for humans and not for computers,
they must be compatible with the cognitive abilities of human beings. This
means that diagrams convey useful information only if their size is kept small
(while 10 entities may be fine, 100 starts being too much and 1000 makes a
diagram unreadable). Several approaches can be adopted to support visualization
and navigation modes making reverse engineered information usable.
They range from the possibility to focus on a portion of the system, to the
expand/collapse or zoom in/out operations, or to the availability of an overall
navigation map complemented by a detailed view. In the following chapters,
ad hoc methods will be described with reference to the specific diagrams being produced.

The eLib Program
The eLib program is a small Java program that supports the main functions
operated in a library. Its code is provided in Appendix A. It will be used in
the remaining of this book as the example.

In eLib, libraries are supposed to hold an archive of documents of different
categories, properly classified. Each document can be uniquely identified by
the librarian. Library users can request some of these documents for loan,
subjected to proper access rules. In order to borrow a document, users must be
identified by the librarian. For example, this could be achieved by distributing
library cards to registered users.

As regards the management of the documents in the eLib system, the
librarian can insert new documents in the archive and remove documents
no longer available in the library. Upon request, the librarian may need to
search the archive for documents according to some search criterion, such as
title, authors, ISBN code, etc. The documents held by a library are of several
different kinds, including books, journals, and technical reports. Each of them
has specific properties and specific access restrictions.

As far as user management is concerned, a set of personal data (name,
address, phone number, etc.) are maintained in the archive. A special category
of users consists of internal users, who have special permission to access
documents not allowed for loan to normal users.
The main functionality of the eLib system is loan management. Users can
borrow documents up to a maximum number. While books are available for
loan to any user, journals can be borrowed only by internal users, and technical
reports can be consulted but not borrowed.

Although this is a small application, by going through the source code
of the eLib program (see Appendix A) it is not so easy to understand how
the classes are organized, how they interact with each other to fulfill the
main functions, how responsibilities are distributed among the classes, what
is computed locally and what is delegated. For example, a programmer aiming
at understanding this application may have the following questions:
What is the overall system organization?
What objects are updated when a document is borrowed?
What classes are responsible to check if a given document can be borrowed
by a given user?
How is the maximum number of loans handled?
What happens to the state of the library when a document is returned?
Let us assume the following change request (perfective maintenance):
When a document is not available for loan, a user can reserve it, if it
has not been previously reserved by another user. When a document
is returned to the library, the user who reserved it is contacted, if
any is associated with the document. The user can either borrow the
document that has become available or cancel the reservation. In both
cases, after this operation the reservation of the document is deleted.
the programmer who is responsible for its implementation may have the following
questions about the system:
Does the overall system organization need any change?
What classes need to collaborate to realize the reservation functionality?

Class Diagram
Is there any possible side effect on the existing functionalities?
What changes should be made in the procedure for returning documents
to the library?
How is the new state of a document described?
Is there any interaction between the new rules for document borrowing
and the existing ones?
In the following sections, we will see how UML diagrams reverse engineered
from the code can help answer the program understanding and impact analysis
questions listed above.

Table of Contents
Foreword XI
Preface XIII

1. Introduction
Reverse Engineering
The eLib Program
Class Diagram
Object Diagram
Interaction Diagrams
State Diagrams
Organization of the Book

2. The Object Flow Graph
Abstract Language
Object Flow Graph
Flow Propagation Algorithm
Object sensitivity
The eLib Program
Related Work

3. Class Diagram
Class Diagram Recovery
Recovery of the inter-class relationships
Declared vs. actual types
Flow propagation
Flow propagation
The eLib Program
Related Work
Object identification in procedural code

4. Object Diagram
The Object Diagram
Object Diagram Recovery
Object Sensitivity
Dynamic Analysis
The eLib Program
OFG Construction
Object Diagram Recovery
Dynamic analysis
Related Work

5. Interaction Diagrams
Interaction Diagrams
Interaction Diagram Recovery
Incomplete Systems
Dynamic Analysis
The eLib Program
Related Work

6. State Diagrams
State Diagrams
Abstract Interpretation
State Diagram Recovery
The eLib Program
Related Work

7. Package Diagram
Package Diagram Recovery
Modularity Optimization
Feature Vectors
Concept Analysis
The eLib Program
Related Work
8. Conclusions
Tool Architecture
Language Model
The eLib Program
Change Location
Impact of the Change
Related Work
 Code Analysis at CERN

A Source Code of the eLib program
B Driver class for the eLib program


e-books shop

Purchase Now !
Just with Paypal

Product details
 2.00 USD
 223 p
 File Size
 6,627 KB
 File Type
 PDF format
 2005 Springer Science 
 +Business Media, Inc 

═════ ═════

Loading... Protection Status