Incrementalizing Production CodeQL Analyses
Tamás Szabó
In Foundations of Software Engineering (FSE), Industry Track. ACM, 2023.
[ pdf ]
Instead of repeatedly re-analyzing from scratch, an incremental static analysis only analyzes a codebase once completely, and then it updates the previous results based on the code changes. While this sounds promising to achieve speed-ups, the reality is that sophisticated static analyses typically employ features that can ruin incremental performance, such as inter-procedurality or context-sensitivity. In this study, we set out to explore whether incrementalization can help to achieve speed-ups for production CodeQL analyses that provide automated feedback on pull requests on GitHub. We first empirically validate the idea by measuring the potential for reuse on real-world codebases, and then we create a prototype incremental solver for CodeQL that exploits incrementality. We report on experimental results showing that we can indeed achieve update times proportional to the size of the code change, and we also discuss the limitations of our prototype.
Concise, Type-Safe, and Efficient Structural Diffing
Sebastian Erdweg, Tamás Szabó, and André Pacak
In Programming Language Design and Implementation (PLDI). ACM, 2021.
[ pdf ]
A structural diffing algorithm compares two pieces of tree-shaped data and computes their difference. Existing structural diffing algorithms either produce concise patches or ensure type safety, but never both. We present a new structural diffing algorithm called truediff that achieves both properties by treating subtrees as mutable, yet linearly typed resources. Mutation is required to derive concise patches that only mention changed nodes, but, in contrast to prior work, truediff guarantees all intermediate trees are well-typed. We formalize type safety, prove truediff has linear run time, and evaluate its performance and the conciseness of the derived patches empirically for real-world Python documents. While truediff ensures type safety, the size of its patches is on par with Gumtree, a popular untyped diffing implementation. Regardless, truediff outperforms Gumtree and a typed diffing implementation by an order of magnitude.
Incremental Whole-Program Analysis in Datalog with Lattices
Tamás Szabó, Sebastian Erdweg, and Gábor Bergmann
In Programming Language Design and Implementation (PLDI). ACM, 2021.
[ pdf ]
Incremental static analyses provide up-to-date analysis results in time proportional to the size of a code change, not the entire code base. This promises fast feedback to programmers in IDEs and when checking in commits. However, existing incremental analysis frameworks fail to deliver on this promise for whole-program lattice-based data-flow analyses. In particular, prior Datalog-based frameworks yield good incremental performance only for intra-procedural analyses.
In this paper, we first present a methodology to empirically test if a computation is amenable to incrementalization. Using this methodology, we find that incremental whole-program analysis may be possible. Second, we present a new incremental Datalog solver called LADDDER to eliminate the shortcomings of prior Datalog-based analysis frameworks. Our Datalog solver uses a non-standard aggregation semantics which allows us to loosen monotonicity requirements on analyses and to improve the performance of lattice aggregators considerably. Our evaluation on real-world Java code confirms that LADDDER provides up-to-date points-to, constant propagation, and interval information in milliseconds.
Incrementalizing Static Analyses in Datalog
Tamás Szabó
Doctoral dissertation, JGU Mainz, 2021.
[ pdf ]
Static analyses are tools that reason about the behavior of computer programs without actually executing them. They play an important role in many areas of software development because they can catch potential runtime errors or reason about the security or performance of subject programs already at development time. For example, Integrated Development Environments (IDEs) use a variety of static analyses to provide continuous feedback to developers, as they edit their programs. IDEs run type checkers to verify that subject programs are free of type inconsistencies, or they run data-flow analyses to find security vulnerabilities. In turn, developers can fix or improve their programs before those go into production, thereby saving significant costs on the harmful effects of failures.
Designing static analyses that provide continuous feedback in IDEs is challenging because analyses must balance a vexing trade-off between precision and performance. On the one hand, an analysis must precisely capture the program behavior by considering all possible concrete executions. On the other hand, an analysis running in an IDE must update its results after a program change in a fraction of a second, otherwise it interrupts the development flow, which is counterproductive. However, these two requirements are in conflict with each other, as more precision comes with more computational cost. Given that we cannot loosen the timing constraint in an IDE, the challenge is how to speed up static analyses, so that we sacrifice on precision as little as possible.
We study how to use incrementality to speed up static analyses. Instead of repeatedly reanalyzing the entire subject program from scratch, an incremental analysis reuses previous results and updates them based on the changed code parts. An incremental analysis can achieve significant performance improvements over its non-incremental counterpart if the computational cost of updating analysis results is proportional to the size of the change and not the size of the entire program. However, precise static analyses typically use features that significantly increase computational costs by requiring the re-analysis of more than just the changed code parts. Moreover, incrementalization requires sophisticated algorithms and data structures, so specialized solutions often do not pay off in terms of development effort. We claim that it is possible to automatically incrementalize static analyses, and incrementality can significantly improve the performance of static analyses.
We design an incremental static analysis framework called IncA. IncA offers a Datalog-based language for the specification of static analyses. The declarative nature of Datalog allows analysis developers to focus on the analyses themselves by leaving the incrementalization to a Datalog solver. We design Datalog solver algorithms to automatically incrementalize analyses and deliver the kind of performance that analyses running in IDEs need. We prove our solver algorithms correct, making sure that analyses incrementalized by IncA compute the exact same results as their non-incremental counterparts. The architecture of IncA is generic, as it is independent of any particular subject language, program, or analysis, enabling the integration of IncA into different IDEs.
We evaluate IncA by incrementalizing existing static analyses for a number of subject languages. We implement well-formedness checks for DSLs, FindBugs rules, data-flow analyses, inter-procedural points-to analysis for Java, plus a type checker for Featherweight Java and Rust. We find that IncA consistently delivers good performance across all of these analyses on real-world programs. Based on these results, we conclude that IncA enables the use of realistic static analyses for continuous feedback in IDEs.
A Systematic Approach to Deriving Incremental Type Checkers
André Pacak, Sebastian Erdweg, and Tamás Szabó
In Proceedings of Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 2020.
[ pdf ]
Static typing can guide programmers if feedback is immediate. Therefore, all major IDEs incrementalize type checking in some way. However, prior approaches to incremental type checking are often specialized and hard to transfer to new type systems. In this paper, we propose a systematic approach for deriving incremental type checkers from textbook-style type system specifications. Our approach is based on compiling inference rules to Datalog, a carefully limited logic programming language for which incremental solvers exist. The key contribution of this paper is to discover an encoding of the infinite typing relation as a finite Datalog relation in a way that yields efficient incremental updates. We implemented the compiler as part of a type system DSL and show that it supports simple types, some local type inference, operator overloading, universal types, and iso-recursive types.
Incrementalizing Inter-procedural Program Analyses with Recursive Aggregation in Datalog
Tamás Szabó, Gábor Bergmann, and Sebastian Erdweg.
In Workshop on Incremental Computing (IC), 2019.
[ pdf ]
Our work efficiently incrementalizes inter-procedural analyses, so that they can be used for interactive applications in IDEs. In prior work, we built the IncAL analysis framework, which could already incrementalize intra-procedural lattice-based analyses. However, scaling to inter-procedural analyses required an entirely new approach in the runtime system. We extended IncAL into the new IncAD system and made use of differential dataflow to incrementalize inter-procedural analyses. Our performance evaluation shows that IncAL delivers sub-second update times for a real-world inter-procedural points-to analysis.
Incrementalizing Lattice-Based Program Analyses in Datalog
Tamás Szabó, Gábor Bergmann, Sebastian Erdweg, and Markus Völter.
In Proceedings of Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 2018.
[ pdf ]
Program analyses detect errors in code, but when code changes frequently as in an IDE, repeated re-analysis from-scratch is unnecessary: It leads to poor performance unless we give up on precision and recall. Incremental program analysis promises to deliver fast feedback without giving up on precision or recall by deriving a new analysis result from the previous one. However, Datalog and other existing frameworks for incremental program analysis are limited in expressive power: They only support the powerset lattice as representation of analysis results, whereas many practically relevant analyses require custom lattices and aggregation over lattice values. To this end, we present a novel algorithm called DRedL that supports incremental maintenance of recursive lattice-value aggregation in Datalog. The key insight of DRedL is to dynamically recognize increasing replacements of old lattice values by new ones, which allows us to avoid the expensive deletion of the old value. We integrate DRedL into the analysis framework IncA and use IncA to realize incremental implementations of strong-update points-to analysis and string analysis for Java. As our performance evaluation demonstrates, both analyses react to code changes within milliseconds.
Incremental Overload Resolution in Object-Oriented Programming Languages
Tamás Szabó, Edlira Kuci, Matthijs Bijman, Mira Mezini, and Sebastian Erdweg.
In Proceedings of International Workshop on Formal Techniques for Java-like Programs (FTfJP). ACM, 2018.
[ pdf ]
Object-oriented programming languages feature static and dynamic overloading: Multiple methods share the same name but provide different implementations. Dynamic overloading (also know as dynamic dispatch) is resolved at run time based on the type of the receiver object. In this paper, we focus on static overloading, which is resolved at compile time based on the types of the method arguments. The challenge this paper addresses is to incrementalize static overload resolution in IDEs. IDEs resolve overloaded methods for the developer to help them discern which implementation a method call refers to. However, as the code changes, the IDE has to reconsider previously resolved method calls when they are affected by the code change. This paper clarifies when a method call is affected by a code change and how to re-resolve method calls with minimal computational effort. To this end, we explore and compare two approaches to incremental type checking: co-contextual type checking and IncA.
An Overview of Program Analysis using Formal Methods
Markus Völter, Tamás Szabó, Björn Engelmann, and Klaus Birken.
An addendum to the book DSL Engineering (http://dslbook.org), 2017.
[ pdf ]
Static program analysis refers to determining properties of programs without executing it, relying on a range of formal methods. While these methods have been around for a long time, over the last couple of years, some of these methods started to scale to solve problems of interesting size. We have used advanced type systems, abstract interpretation, SMT solving and model checking to answer relevant questions about programs written with various DSLs. In this booklet we introduce the methods, illustrate what we have done with them, and describe how we have integrated the analysis method and existing tools with languages and IDEs.
IncAL: A DSL for Incremental Program Analysis with Lattices
Tamás Szabó, Sebastian Erdweg, and Markus Völter.
In Workshop on Incremental Computing (IC), 2017.
[ pdf ]
We describe IncAL, a DSL for incremental lattice-based program analyses. IncAL is an extension of our previous work, IncA, which supported relational program analyses, that has been used for practically relevant analyses on industrial code bases. IncAL improves the expressive power of IncA by adding support for synthesis of data, enabling, for example, incremental execution of interval analysis.