2023
Incrementalizing Production CodeQL Analyses
Tamás Szabó
In Foundations of Software Engineering (FSE), Industry Track. ACM, 2023.
[ pdf ]
Instead of repeatedly re-analyzing from scratch, an incremental
static analysis only analyzes a codebase once completely, and then
it updates the previous results based on the code changes. While
this sounds promising to achieve speed-ups, the reality is that
sophisticated static analyses typically employ features that can ruin
incremental performance, such as inter-procedurality or context-sensitivity.
In this study, we set out to explore whether incrementalization can help to
achieve speed-ups for production CodeQL analyses that provide automated feedback on pull requests on GitHub.
We first empirically validate the idea by measuring the potential
for reuse on real-world codebases, and then we create a prototype
incremental solver for CodeQL that exploits incrementality. We
report on experimental results showing that we can indeed achieve
update times proportional to the size of the code change, and we
also discuss the limitations of our prototype.
2021
Concise, Type-Safe, and Efficient Structural Diffing
Sebastian Erdweg, Tamás Szabó, and André Pacak
In Programming Language Design and Implementation (PLDI). ACM, 2021.
[ pdf ]
A structural diffing algorithm compares two pieces of tree-shaped data and computes their difference.
Existing structural diffing algorithms either produce concise patches or ensure type safety, but never both.
We present a new structural diffing algorithm called truediff that achieves both properties by treating subtrees
as mutable, yet linearly typed resources. Mutation is required to derive concise patches that only mention
changed nodes, but, in contrast to prior work, truediff guarantees all intermediate trees are well-typed.
We formalize type safety, prove truediff has linear run time, and evaluate its performance and the conciseness
of the derived patches empirically for real-world Python documents. While truediff ensures type safety,
the size of its patches is on par with Gumtree, a popular untyped diffing implementation. Regardless,
truediff outperforms Gumtree and a typed diffing implementation by an order of magnitude.
Incremental Whole-Program Analysis in Datalog with Lattices
Tamás Szabó, Sebastian Erdweg, and Gábor Bergmann
In Programming Language Design and Implementation (PLDI). ACM, 2021.
[ pdf ]
Incremental static analyses provide up-to-date analysis results in time proportional to the size of
a code change, not the entire code base. This promises fast feedback to
programmers in IDEs and when checking in commits. However, existing incremental analysis frameworks
fail to deliver on this promise for whole-program lattice-based data-flow analyses. In particular, prior
Datalog-based frameworks yield good incremental performance only for intra-procedural analyses.
In this paper, we first present a methodology to empirically
test if a computation is amenable to incrementalization. Using this methodology, we find that
incremental whole-program analysis may be possible. Second, we present a new incremental Datalog
solver called LADDDER to eliminate the shortcomings of prior Datalog-based analysis frameworks.
Our Datalog solver uses a non-standard aggregation semantics which allows us to loosen monotonicity
requirements on analyses and to improve the performance of lattice aggregators considerably.
Our evaluation on real-world Java code confirms that LADDDER provides up-to-date
points-to, constant propagation, and interval information in milliseconds.
Incrementalizing Static Analyses in Datalog
Tamás Szabó
Doctoral dissertation, JGU Mainz, 2021.
[ pdf ]
Static analyses are tools that reason about the behavior of computer programs without actually executing them.
They play an important role in many areas of software development because they can catch potential runtime errors
or reason about the security or performance of subject programs already at development time. For example,
Integrated Development Environments (IDEs) use a variety of static analyses to provide continuous feedback to
developers, as they edit their programs. IDEs run type checkers to verify that subject programs are free of
type inconsistencies, or they run data-flow analyses to find security vulnerabilities. In turn, developers
can fix or improve their programs before those go into production, thereby saving significant costs on the
harmful effects of failures.
Designing static analyses that provide continuous feedback in IDEs is challenging because analyses must balance
a vexing trade-off between precision and performance. On the one hand, an analysis must precisely capture the
program behavior by considering all possible concrete executions. On the other hand, an analysis running in an
IDE must update its results after a program change in a fraction of a second, otherwise it interrupts the
development flow, which is counterproductive. However, these two requirements are in conflict with each other,
as more precision comes with more computational cost. Given that we cannot loosen the timing constraint in an IDE,
the challenge is how to speed up static analyses, so that we sacrifice on precision as little as possible.
We study how to use incrementality to speed up static analyses. Instead of repeatedly reanalyzing the entire
subject program from scratch, an incremental analysis reuses previous results and updates them based on the
changed code parts. An incremental analysis can achieve significant performance improvements over its
non-incremental counterpart if the computational cost of updating analysis results is proportional to the
size of the change and not the size of the entire program. However, precise static analyses typically use
features that significantly increase computational costs by requiring the re-analysis of more than just the
changed code parts. Moreover, incrementalization requires sophisticated algorithms and data structures,
so specialized solutions often do not pay off in terms of development effort. We claim that it is possible
to automatically incrementalize static analyses, and incrementality can significantly improve the performance
of static analyses.
We design an incremental static analysis framework called IncA. IncA offers a Datalog-based language for
the specification of static analyses. The declarative nature of Datalog allows analysis developers to
focus on the analyses themselves by leaving the incrementalization to a Datalog solver. We design Datalog
solver algorithms to automatically incrementalize analyses and deliver the kind of performance that
analyses running in IDEs need. We prove our solver algorithms correct, making sure that analyses
incrementalized by IncA compute the exact same results as their non-incremental counterparts. The
architecture of IncA is generic, as it is independent of any particular subject language, program,
or analysis, enabling the integration of IncA into different IDEs.
We evaluate IncA by incrementalizing existing static analyses for a number of subject languages.
We implement well-formedness checks for DSLs, FindBugs rules, data-flow analyses, inter-procedural
points-to analysis for Java, plus a type checker for Featherweight Java and Rust. We find that IncA
consistently delivers good performance across all of these analyses on real-world programs. Based on
these results, we conclude that IncA enables the use of realistic static analyses for continuous feedback in IDEs.
2020
A Systematic Approach to Deriving Incremental Type Checkers
André Pacak, Sebastian Erdweg, and Tamás Szabó
In Proceedings of Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 2020.
[ pdf ]
Static typing can guide programmers if feedback is immediate. Therefore, all major IDEs incrementalize type
checking in some way. However, prior approaches to incremental type checking are often specialized and hard
to transfer to new type systems. In this paper, we propose a systematic approach for deriving incremental
type checkers from textbook-style type system specifications. Our approach is based on compiling inference
rules to Datalog, a carefully limited logic programming language for which incremental solvers exist. The
key contribution of this paper is to discover an encoding of the infinite typing relation as a finite Datalog
relation in a way that yields efficient incremental updates. We implemented the compiler as part of a type
system DSL and show that it supports simple types, some local type inference, operator overloading, universal
types, and iso-recursive types.
2019
Incrementalizing Inter-procedural Program Analyses with Recursive Aggregation in Datalog
Tamás Szabó, Gábor Bergmann, and Sebastian Erdweg.
In Workshop on Incremental Computing (IC), 2019.
[ pdf ]
Our work efficiently incrementalizes inter-procedural analyses, so that they can be used for
interactive applications in IDEs. In prior work, we built the IncAL analysis framework,
which could already incrementalize intra-procedural lattice-based analyses. However, scaling to
inter-procedural analyses required an entirely new approach in the runtime system. We extended
IncAL into the new IncAD system and made use of differential dataflow to incrementalize
inter-procedural analyses. Our performance evaluation shows that IncAL delivers sub-second update
times for a real-world inter-procedural points-to analysis.
2018
Incrementalizing Lattice-Based Program Analyses in Datalog
Tamás Szabó, Gábor Bergmann, Sebastian Erdweg, and Markus Völter.
In Proceedings of Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 2018.
[ pdf ]
Program analyses detect errors in code, but when code changes frequently as in an IDE,
repeated re-analysis from-scratch is unnecessary: It leads to poor performance unless
we give up on precision and recall. Incremental program analysis promises to deliver
fast feedback without giving up on precision or recall by deriving a new analysis
result from the previous one. However, Datalog and other existing frameworks for
incremental program analysis are limited in expressive power: They only support the
powerset lattice as representation of analysis results, whereas many practically
relevant analyses require custom lattices and aggregation over lattice values. To
this end, we present a novel algorithm called DRedL that supports incremental
maintenance of recursive lattice-value aggregation in Datalog. The key insight of
DRedL is to dynamically recognize increasing replacements of old lattice values by
new ones, which allows us to avoid the expensive deletion of the old value. We
integrate DRedL into the analysis framework IncA and use IncA to realize
incremental implementations of strong-update points-to analysis and string analysis
for Java. As our performance evaluation demonstrates, both analyses react to code
changes within milliseconds.
Incremental Overload Resolution in Object-Oriented Programming Languages
Tamás Szabó, Edlira Kuci, Matthijs Bijman, Mira Mezini, and Sebastian Erdweg.
In Proceedings of International Workshop on Formal Techniques for Java-like Programs (FTfJP). ACM, 2018.
[ pdf ]
Object-oriented programming languages feature static and
dynamic overloading: Multiple methods share the same name
but provide different implementations. Dynamic overloading
(also know as dynamic dispatch) is resolved at run time based
on the type of the receiver object. In this paper, we focus
on static overloading, which is resolved at compile time
based on the types of the method arguments.
The challenge this paper addresses is to incrementalize static
overload resolution in IDEs. IDEs resolve overloaded methods
for the developer to help them discern which implementation a
method call refers to. However, as the code changes, the IDE
has to reconsider previously resolved method calls when they
are affected by the code change. This paper clarifies when a
method call is affected by a code change and how to re-resolve
method calls with minimal computational effort. To this end,
we explore and compare two approaches to incremental type
checking: co-contextual type checking and IncA.
2017
An Overview of Program Analysis using Formal Methods
Markus Völter, Tamás Szabó, Björn Engelmann, and Klaus Birken.
An addendum to the book DSL Engineering (http://dslbook.org), 2017.
[ pdf ]
Static program analysis refers to determining properties of programs
without executing it, relying on a range of formal methods. While
these methods have been around for a long time, over the last couple
of years, some of these methods started to scale to solve problems
of interesting size. We have used advanced type systems, abstract
interpretation, SMT solving and model checking to answer relevant
questions about programs written with various DSLs. In this booklet
we introduce the methods, illustrate what we have done with them,
and describe how we have integrated the analysis method and existing
tools with languages and IDEs.
IncAL: A DSL for Incremental Program Analysis with Lattices
Tamás Szabó, Sebastian Erdweg, and Markus Völter.
In Workshop on Incremental Computing (IC), 2017.
[ pdf ]
We describe IncAL, a DSL for incremental lattice-based
program analyses. IncAL is an extension of our previous work,
IncA, which supported relational program analyses, that has
been used for practically relevant analyses on industrial code
bases. IncAL improves the expressive power of IncA by
adding support for synthesis of data, enabling, for example,
incremental execution of interval analysis.
2016
Lessons Learned from Developing mbeddr: A Case Study in Language Engineering with MPS
Markus Völter, Bernd Kolb, Tamás Szabó, Daniel Ratiu, and Arie van Deursen.
In Software and Systems Modeling, 2016.
[ pdf ]
Language workbenches are touted as a promising technology to engineer languages for use in a wide range
of domains, from programming to science to business. However, not many real-world case studies exist
that evaluate the suitability of language workbench technology for this task. This paper contains such a
case study.
In particular, we evaluate the development of mbeddr, a collection of integrated languages and
language extensions built with the Jetbrains MPS language workbench. mbeddr consists of 81 languages,
with their IDE support, 34 of them C extensions. The mbeddr languages use a wide variety of
notations – textual, tabular, symbolic and graphical – and the C extensions are modular; new extensions
can be added without changing the existing implementation of C. mbeddr’s development has spanned 10 person
years so far, and the tool is used in practice and continues to be developed. This makes mbeddr a
meaningful case study of non-trivial size and complexity.
The evaluation is centered around five research questions: language modularity, notational freedom and
projectional editing, mechanisms for managing complexity, performance and scalability issues and the
consequences for the development process.
Dedicated Support for Analyses and Optimizations in Language Workbenches
Tamás Szabó.
In Proceedings of Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH) - Doctoral Symposium, ACM, 2016.
[ pdf ]
Language workbenches are widely used to implement domain-specific languages (DSLs)
and their accompanying integrated development environments (IDEs). They help to define
the abstract syntax, concrete syntax(es), type system, and transformations for the languages.
However, there are other language aspects, specifically program analyses and optimizations,
that are also crucial to a language implemen- tation, but state-of-the-art language workbenches
has only limited support for them. The high implementation effort for these language aspects
is justifiable for a general-purpose language (GPL), but is not justifiable for DSLs because
of their different development economies.
To this end, I conduct research on dedicated support for analyses and optimizations for DSLs
in language workbenches. My main goal is to develop declarative meta-languages that help to
define static program analyses and that capture and automate patterns and techniques of optimizations.
The research directions are directly driven by industrial need, and upon successful completion,
the results would be applied in projects centered around DSLs for high-performance computing (HPC),
insurance, and concurrent embedded systems.
Efficient Development of Consistent Projectional Editors using Grammar Cells
Markus Völter, Tamás Szabó, Sascha Lisson, Bernd Kolb, Sebastian Erdweg, and Thorsten Berger.
In Proceedings of Software Language Engineering (SLE), ACM, 2016.
[ pdf ]
The definition of a projectional editor does not just specify the notation of a
language, but also how users interact with the notation. Because of that it is
easy to end up with different interaction styles within one and between multiple
languages. The resulting inconsistencies have proven to be a major usability problem.
To address this problem, we introduce grammar cells, an approach for declaratively
specifying textual notations and their interactions for projectional editors.
In the paper we motivate the problem, give a formal definition of grammar cells,
and define their mapping to low-level editor behaviors. Our evaluation based
on project experience shows that grammar cells improve editing experience by
providing a consistent and intuitive, “text editor-like” user experience for
textual notations. At the same time they do not limit language composability
and the use of non-textual notations, the primary benefits of projectional
editors. We have implemented grammar cells for Jetbrains MPS, but they can also
be used with other projectional editors.
IncA: A DSL for the Definition of Incremental Program Analyses
Tamás Szabó, Sebastian Erdweg, and Markus Voelter.
In Proceedings of International Conference on Automated Software Engineering (ASE), ACM, 2016.
[ pdf ]
Program analyses support software developers, for example, through error
detection, code-quality assurance, and by enabling compiler optimizations
and refactorings. To provide real-time feedback to developers within IDEs,
an analysis must run efficiently even if the analyzed code base is large.
To achieve this goal, we present a domain-specific language called IncA for
the definition of efficient incremental program analyses that update their
result as the program changes. IncA compiles analyses into graph patterns
and relies on existing incremental matching algorithms. To scale IncA analyses
to large programs, we describe optimizations that reduce caching and prune
change propagation. Using IncA, we have developed incremental control flow
and points-to analysis for C, well-formedness checks for DSLs, and 10 FindBugs
checks for Java. Our evaluation demonstrates significant speedups for all analyses
compared to their non-incremental counterparts.
An Extensible Framework for Variable-Precision Data-Flow Analyses in MPS
Tamás Szabó, Simon Alperovich, Markus Voelter, and Sebastian Erdweg.
In Proceedings of International Conference on Automated Software Engineering (ASE) - Tool Track, ACM, 2016.
[ pdf ]
Data-flow analyses are used as part of many software engineering
tasks: they are the foundations of program understanding, refactorings
and optimized code generation. Similar to general-purpose languages
(GPLs), state-of-the-art domain-specific languages (DSLs) also require
sophisticated data-flow analyses. However, as a consequence of the
different economies of DSL development and their typically relatively
fast evolution, the effort for developing and evolving such analyses
must be lowered compared to GPLs. This tension can be resolved with
dedicated support for data-flow analyses in language workbenches.
In this tool paper we present MPS-DF, which is the component in the MPS
language workbench that supports the definition of data-flow analyses
for DSLs. Language developers can define data-flow graph builders
declaratively as part of a language definition and compute analysis
results efficiently based on these data-flow graphs. MPS-DF is
extensible such that it does not compromise the support for language
composition in MPS. Additionally, clients of MPS-DF analyses can run the
analyses with variable precision thus trading off precision for
performance. This allows clients to tailor an analysis to a particular
use case.
2014
mbeddr: Extensible Languages for Embedded Software Development
Tamás Szabó, Markus Voelter, Bernd Kolb, Daniel Ratiu, Bernhard Schaetz.
In Proceedings of High Integrity Language Technology (HILT), ACM, 2014.
[ pdf ]
In this industrial presentation we will demonstrate mbeddr,
an extensible set of integrated languages for embedded software
development. After discussing the context of the talk,
we will give details about the mbeddr architecture, which
relies on the MPS language workbench. Then, we will elaborate
on the extension modules and show how they fit with
safety-critical development processes. Finally, we will point
out how the existing languages can be extended by the user
by giving some real-world examples, including a language
construct that could have prevented the Apple “goto fail”
bug as well as mathematical notations.
2012
Developing and Visualizing Live Model Queries
Zoltán Újhelyi, Tamás Szabó, István Ráth, Dániel Varró.
In Proceedings of Analysis of Model Transformations (AMT), ACM, 2012.
[ pdf ]
Several important tasks performed by model driven development tools —
such as well-formedness constraint validation or model transformations —
rely on evaluating model queries. If the model changes rapidly or frequently,
it is beneficial to provide live queries that automatically propagate
these model changes into the query results. To ease the development and
debugging of live queries, the development environment should provide a
way to evaluate the query results continuously, helping to understand how the created query works.
This paper presents a generic live model query visualizer that displays
and updates the query results depending on their source models. It has
been implemented for the EMF-IncQuery framework and presented here
for validating BPMN models.
Incremental Pattern Matching for the Efficient Computation of Transitive Closure
Gábor Bergmann, István Ráth, Tamás Szabó, Paolo Torrini, Dániel Varró.
In Proceedings of International Conference on Graph Transformations (ICGT), Springer-Verlag, 2012.
[ pdf ]
Pattern matching plays a central role in graph transformations as a key
technology for computing local contexts in which transformation rules are to be applied.
Incremental matching techniques offer a performance advantage over the search-based approach,
in a number of scenarios including on-the-fly model synchronization, model simulation,
view maintenance, well-formedness checking and state space traversal. However, the incremental
computation of transitive closure in graph pattern matching has started to be investigated
only recently. In this paper, we propose multiple algorithms for the efficient computation of
generalized transitive closures. As such, our solutions are capable of computing reachability
regions defined by simple graph edges as well as complex binary relationships defined by
graph patterns, that may be used in a wide spectrum of modeling problems. We also report on
experimental evaluation of our prototypical implementation, carried out within the context
of a stochastic system simulation case study.
2011
Parallel Saturation Based Model Checking
András Vörös, Tamás Szabó, Attila Jámbor, Dániel Darvas, Ákos Horváth, Tamás Bartha.
In Proceedings of International Symposium on Parallel and Distributed Computing (ISPDC), IEEE, 2011.
[ pdf ]
Formal verification is becoming a fundamental step
of safety-critical and model-based software development. As
part of the verification process, model checking is one of the
current advanced techniques to analyze the behavior of a
system. In this paper, we examine an existing parallel model
checking algorithm and we propose improvements to eliminate
some computational bottlenecks. Our measurements show that
the resulting new algorithm has better scalability and
performance than both the former parallel approach and the
sequential algorithm.