Interactive C++ for Data Science
Interactive C++ for Data Science
In our previous blog post “Interactive C++ with Cling” we mentioned that exploratory programming is an effective way to reduce the complexity of the problem. This post will discuss some applications of Cling developed to support data science researchers. In particular, interactively probing data and interfaces makes complex libraries and complex data more accessible to users. We aim to demonstrate some of Cling’s features at scale; Cling’s eval-style programming support; projects related to Cling; and show interactive C++/CUDA.
Eval-style programming
A Cling instance can access itself through its runtime. The example creates a
cling::Value
to store the execution result of the incremented variable i
.
That mechanism can be used further to support dynamic scopes extending the name
lookup at runtime.
[cling]$ #include <cling/Interpreter/Value.h>
[cling]$ #include <cling/Interpreter/Interpreter.h>
[cling]$ int i = 1;
[cling]$ cling::Value V;
[cling]$ gCling->evaluate("++i", V);
[cling]$ i
(int) 2
[cling]$ V
(cling::Value &) boxes [(int) 2]
V
“boxes” the expression result providing extended lifetime if necessary.
The cling::Value
can be used to communicate expression values from the
interpreter to compiled code.
[cling]$ ++i
(int) 3
[cling]$ V
(cling::Value &) boxes [(int) 2]
This mechanism introduces a delayed until runtime evaluation which enables some features increasing the dynamic look and feel of the C++ language.
The ROOT data analysis package
The main tool for storage, research and visualization of scientific data in the field of high energy physics (HEP) is the specialized software package ROOT. ROOT is a set of interconnected components that assist scientists from data storage and research to their visualization when published in a scientific paper. ROOT has played a significant role in scientific discoveries such as gravitational waves, the great cavity in the Pyramid of Cheops, the discovery of the Higgs boson by the Large Hadron Collider. For the last 5 years, Cling has helped to analyze 1 EB physical data, serving as a basis for over 1000 scientific publications, and supports software run across a distributed million CPU core computing facility.
ROOT uses Cling as a reflection information service for data serialization. The C++ objects are stored in a binary format, vertically. The content of a loaded data file is made available to the users and C++ objects become a first class citizen.
A central component of ROOT enabled by Cling is eval-style programming. We use this in HEP to make it easy to inspect and use C++ objects stored by ROOT. Cling enables ROOT to inject available object names into the name lookup when a file is opened:
[root] ntuple->GetTitle()
error: use of undeclared identifier 'ntuple'
[root] TFile::Open("tutorials/hsimple.root"); ntuple->GetTitle() // #1
(const char *) "Demo ntuple"
[root] gFile->ls();
TFile** tutorials/hsimple.root Demo ROOT file with histograms
TFile* tutorials/hsimple.root Demo ROOT file with histograms
OBJ: TH1F hpx This is the px distribution : 0 at: 0x7fadbb84e390
OBJ: TNtuple ntuple Demo ntuple : 0 at: 0x7fadbb93a890
KEY: TH1F hpx;1 This is the px distribution
[...]
KEY: TNtuple ntuple;1 Demo ntuple
[root] hpx->Draw()
The ROOT framework injects additional names to the name lookup on two stages.
First, it builds an invalid AST by marking the occurrence of ntuple (#1), then
it is transformed into
gCling->EvaluateT</*return type*/void>("ntuple->GetTitle()", /*context*/);
On the next stage, at runtime, ROOT opens the file, reads its preambule and
injects the names via the external name lookup facility in clang. The
transformation becomes more complex if ntuple->GetTitle()
takes arguments.
Figure 1. Interactive plot of the px distribution read from a root file.
C++ in Notebooks
Section Author: Sylvain Corlay, QuantStack
The Jupyter Notebook technology allows users to create and share documents that contain live code, equations, visualizations and narrative text. It enables data scientists to easily exchange ideas or collaborate by sharing their analyses in a straight-forward and reproducible way. Language agnosticism is a key design principle for the Jupyter project, and the Jupyter frontend communicates with the kernel (the part of the infrastructure that runs the code) through a well-specified protocol. Kernels have been developed for dozens of programming languages, such as R, Julia, Python, Fortran (through the LLVM-based LFortran project).
Jupyter’s official C++ kernel relies on Xeus, a C++ implementation of the kernel protocol, and Cling. An advantage of using a reference implementation for the kernel protocol is that a lot of features come for free, such as rich mime type display, interactive widgets, auto-complete, and much more.
Rich mime-type rendering for user-defined types can be specified by providing
an overload of mime_bundle_repr
for the said type, which is picked up by
argument dependent lookup.
Figure 2. Inline rendering of images in JupyterLab for a user-defined image type.
Possibilities with rich mime type rendering are endless, such as rich display of dataframes with HTML tables, or even mime types that are rendered in the front-end with JavaScript extensions.
An advanced example making use of rich rendering with Mathjax is the SymEngine symbolic computing library.
Figure 3. Using rich mime type rendering in Jupyter with the Symengine package.
Xeus-cling comes along with an implementation of the Jupyter widgets protocol which enables bidirectional communication with the backend.
Figure 4. Interactive widgets in the JupyterLab with the C++ kernel.
More complex widget libraries have been enabled through this framework like xleaflet.
Figure 5. Interactive GIS in C++ in JupyterLab with xleaflet.
Other features include rich HTML help for the standard library and third-party packages:
Figure 6. Accessing cppreference for std::vector from JupyterLab by typing `?std::vector`.
The Xeus and Xeus-cling kernels were recently incorporated as subprojects to Jupyter, and are governed by its code of conduct and general governance.
Planned future developments for the xeus-cling kernel include: adding support
for the Jupyter console interface, through an implementation of the Jupyter
is_complete
message, currently lacking; adding support for cling
“dot commands” as Jupyter magics; and supporting the new debugger protocol that
was recently added to the Jupyter kernel protocol, which will enable the use of
the JupyterLab visual debugger with the C++ kernel.
Another tool that brings interactive plotting features to xeus-cling is xvega, which is at an early stage of development, produces vega charts that can be displayed in the notebook.
Figure 7. The xvega plotting library in the xeus-cling kernel.
CUDA C++
Section Author: Simeon Ehrig, HZDR
The Cling CUDA extension brings the workflows of interactive C++ to GPUs without losing performance and compatibility to existing software. To execute CUDA C++ Code, Cling activates an extension in the compiler frontend to understand the CUDA C++ dialect and creates a second compiler instance that compiles the code for the GPU.
Figure 8. CUDA/C++ information flow in Cling.
Like the normal C++ mode, the CUDA C++ mode uses AST transformation to enable interactive CUDA C++ or special features as the Cling print system. In contrast to the normal Cling compiler pipeline used for the host code, the device compiler pipeline does not use all the transformations of the host pipeline. Therefore, the device pipeline has some special transformation.
[cling] #include <iostream>
[cling] #include <cublas_v2.h>
[cling] #pragma cling(load "libcublas.so") // link a shared library
// set parameters
// allocate memory
// ...
[cling] __global__ void init(float *matrix, int size){
[cling] ? int x = blockIdx.x * blockDim.x + threadIdx.x;
[cling] ? if (x < size)
[cling] ? matrix[x] = x;
[cling] ? }
[cling]
[cling] // launching a function direct in the global space
[cling] init<<<blocks, threads>>>(d_A, dim*dim);
[cling] init<<<blocks, threads>>>(d_B, dim*dim);
[cling]
[cling] cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, d_A, dim, d_B, dim, &beta, d_C, dim);
[cling] cublasGetVector(dim*dim, sizeof(h_C[0]), d_C, 1, h_C, 1);
[cling] cudaGetLastError()
(cudaError_t) (cudaError::cudaSuccess) : (unsigned int) 0
Like the normal C++ mode, the CUDA mode can be used in a Jupyter Notebook.
Figure 9. CUDA/C++ information flow in Cling.
A special property of Cling in CUDA mode is that the Cling application becomes a
normal CUDA application at the time of the first CUDA API call. This enables the
CUDA SDK with Cling. For example, you can use the CUDA profiler
nvprof ./cling -xcuda
to profile your interactive application.
This docker container can be used to
experiment with Cling’s CUDA mode.
Planned future developments for the CUDA mode include: Supporting of the complete current CUDA API; Redefining CUDA Kernels; Supporting other GPU SDK’s like HIP (AMD) and SYCL (Intel).
Conclusion
We see the use of Interactive C++ as an important tool to develop for researchers in the data science community. Cling has enabled ROOT to be the “go to” data analysis tool in the field of High Energy Physics for everything from efficient I/O to plotting and fitting. The interactive CUDA backend allows easy integration of research workflows and simpler communication between C++ and CUDA. As Jupyter Notebooks have become a standard way for data analysts to explore ideas, Xeus-cling ensures that great interactive C++ ingredients are available in every C++ notebook.
In the next blog post we will focus on Cling enabling features beyond interactive C++, and in particular language interoperability.
Acknowledgements
The author would like to thank Sylvain Corlay, Simeon Ehrig, David Lange, Chris Lattner, Javier Lopez Gomez, Wim Lavrijsen, Axel Naumann, Alexander Penev, Xavier Valls Pla, Richard Smith, Martin Vassilev, who contributed to this post.
You can find out more about our activities at https://root.cern/cling/ and https://compiler-research.org.