Through the years, we now have heard from clients — significantly these working GoodData in-house on their very own premises — that they want to combine knowledge from numerous inside APIs into their studies and dashboards in actual time.
These clients typically have essential knowledge in legacy and proprietary programs, or programs that don’t simply match into conventional BI options — NoSQL, APIs, Machine Studying fashions, and extra.
The significance of integrating this inside knowledge can’t be overstated. Actual-time insights from these knowledge sources are sometimes essential for making knowledgeable selections, optimizing operations, and sustaining a aggressive edge. Whether or not it is up-to-the-minute gross sales figures, operational metrics, or buyer engagement knowledge, having all of the related data accessible in a single place empowers companies to react swiftly.
Confronted with this problem, we got down to discover a answer that might enable our clients to simply combine knowledge from any supply — no matter format or protocol — into the GoodData platform. Our purpose was to create a versatile, sturdy system that might deal with numerous knowledge inputs with out sacrificing efficiency or requiring intensive re-engineering of the present infrastructure.
On this article, I’ll clarify why and the way we launched into this journey, and the technological and architectural selections we made to handle the necessity for customized knowledge supply integration. Particularly, we’ll look into how we leveraged our present structure (based mostly on Apache Arrow and Flight RPC) to construct an answer that meets these advanced necessities, the challenges we confronted, the options we got here up with, and the way we prolonged our FlexQuery subsystem to accommodate this new performance.
This text is a part of the FlexConnect Launch Sequence. Do not miss our different articles on subjects resembling API ingestion, NoSQL integration, Kafka connectivity, Machine Studying, and Unity Catalog!
FlexQuery Structure Overview
Earlier than I describe how we prolonged GoodData to accommodate the customized knowledge sources, I will provide you with a short overview of our structure. Now we have a few in-depth posts about this if you’re , however right here I’ll present only a fast abstract.
GoodData computes analytics utilizing a semantic mannequin. Engineers create semantic fashions that map between the logical world (datasets, attributes, info, metrics) and the bodily world (knowledge supply — sometimes an RDBMS or knowledge warehouse, with tables and views).
GoodData customers create studies that work with the logical entities, and our system then interprets these studies into bodily plans. So, for instance, when the info supply utilized by the logical mannequin is an information warehouse, GoodData will construct an SQL question to acquire the info, carry out extra transformations (resembling reshaping, sorting, and including totals), after which serve this consequence through an API in order that it may be rendered in a visualization.
Now we have encapsulated all these ideas of the GoodData platform right into a subsystem that we name FlexQuery. This subsystem works carefully with the semantic mannequin and:
- Creates a bodily plan
- Orchestrates execution of the bodily plan
- Connects to knowledge sources to acquire uncooked knowledge
- Transparently caches outcomes and/or intermediate outcomes for quick and environment friendly reuse
It additionally performs various kinds of post-processing operations; to generalize, there are primarily these two:
- Dataframe operations the place the info is wrangled or enriched utilizing Pandas/Polars
- Further SQL operations the place the info is post-processed utilizing SQL in our native, dynamically created DuckDB situations
The next diagram offers a high-level define of the structure and its interactions. Word that each the intermediate and closing computation outcomes are cached for reuse. If the system already finds the info in caches, it can short-circuit all of the advanced processing and transparently work with the cached knowledge.
FlexQuery is constructed totally on high of Apache Arrow. Now we have adopted each the format and the Arrow Flight RPC because the API for the info companies. We took nice care to design our use of Flight RPC in order that the layer of our knowledge companies permits for knowledge processing composability.
In different phrases, as a result of all the info in FlexQuery is in Arrow and all the info companies implement the identical API (the Flight RPC), we will simply mix and recombine the present knowledge companies to handle new necessities and use circumstances.
Take, for instance, the info service that permits SQL post-processing utilizing DuckDB. This knowledge service is necessary — it can play an enormous function later within the article. The info service is designed to satisfy the next contract: generate a brand new Flight by working SQL on high of information from a number of different Flights. The service obtains Arrow knowledge for the enter Flights (described by their Flight descriptors), masses the info into DuckD, and runs arbitrary SQL. The result’s then both streamed out for one-time studying or transparently cached (as a flight path) for repeated reads.
Ultimately, our service that performs the SQL post-processing doesn’t care the place the info comes from—the one factor that issues is that it’s in Arrow format. The info could come from caches, an RDBMS, an information warehouse, file-on-object storage, or an arbitrary non-SQL knowledge supply — it could even be the case that knowledge for every desk comes from a distinct knowledge supply.
With such flexibility accessible, we will shortly create various kinds of knowledge processing flows to handle completely different product use circumstances. Within the subsequent part, I’ll describe how we leveraged this to allow customized knowledge sources.
Customized Information Supply With Arrow and Flight RPC
Once we began designing the aptitude to construct and combine customized knowledge sources with GoodData, we labored with a number of most important necessities:
Easy Integration Contract
The mixing contract should be easy, and importantly, should not impose difficult implementation necessities on the customized knowledge supply.
For instance, establishing a contract the place the customized knowledge supply receives requests as SQL wouldn’t be a good suggestion — purely as a result of it forces the implementation to know and act on SQL. The customized knowledge sources could encapsulate advanced algorithms, bridge to no-SQL database programs, and even to completely different APIs.
Seamless Information Supply Integration
The customized knowledge supply should combine seamlessly with our present semantic mannequin, the APIs, and the functions we now have already constructed.
For instance, we now have a visible modeler for the applying that permits customers to find datasets accessible within the knowledge supply after which add them to the semantic mannequin.
Prepared for Advanced Metrics
Customers should have the ability to construct advanced metrics on high of the info offered by the customized knowledge supply; the info supply itself should be shielded from this — our system should do the advanced computations on high of the info offered by the info supply.
We shortly realized that what we finally wish to obtain is just like desk capabilities — a function supported by many database programs. In brief, a desk perform, or a user-defined desk perform, has these qualities:
- Takes a set of parameters
- Makes use of code to return (generate) desk knowledge
- Can be utilized like a traditional desk
The large advantage of desk capabilities is the simplicity of the contract between the remainder of the system and the perform itself: the perform is named with a set of parameters and anticipated to return desk knowledge.
The desk perform mannequin is a perfect place to begin for our ‘construct your individual knowledge supply’ story:
- The contract is easy — however nonetheless versatile sufficient
- The capabilities return tables with identified schema — this may be simply mapped to datasets in our semantic mannequin
- The return worth is a desk — which lends itself to extra, presumably advanced SQL post-processing on our facet
The twist in our case is that these desk capabilities are distant. We wish to enable builders to construct a server with their capabilities, register this server to GoodData as every other knowledge supply, after which combine the accessible desk capabilities into the mannequin. That’s the place Apache Arrow and the Flight RPC come into play once more.
Distant Desk Operate Invocation
It’s simple to map an invocation of a desk perform to Flight RPC: it means producing a brand new Flight described by a command. On this case, the command incorporates the identify of the desk perform and a set of parameters. The invocation stream is then the usual Flight RPC GetFlightInfo — DoGet medley as described within the Flight RPC specification.
Distant Desk Operate Discovery
Subsequent, we needed to remedy find out how to obtain discoverability of the capabilities that some Flight RPC knowledge service implements. That is important in our context as a result of we wish to combine these customized knowledge sources with our present tooling — which exhibits all of the accessible datasets in an information supply and permits customers to drag-and-drop them into the mannequin.
For this, we repurposed the Flight RPC’s present ListFlights methodology. Usually, the ListFlights methodology lists present Flights which can be accessible on some knowledge service and may be picked up utilizing DoGet.
In our case, we barely modified the contract in order that the ListFlights methodology returns details about Flights that the server can generate. Every FlightInfo returned within the itemizing:
- Accommodates a Flight descriptor that features a command; the command payload is a template that can be utilized to construct a payload for the precise perform invocation
- Accommodates Arrow Schema that describes the desk generated by the perform — an important piece of data wanted to create datasets in our semantic mannequin
- Doesn’t comprise any places — just because the FlightInfo doesn’t describe an present flight whose knowledge may be instantly consumed
Integration with FlexQuery
At this level, all of the items of the puzzle are on the desk:
- A customized knowledge supply exposing desk capabilities
- Discoverability of the capabilities
- Mapping between semantic mannequin datasets and the capabilities
- Technique for the invocation of the capabilities
All that continues to be now could be ‘simply’ placing the items collectively. This is a perfect place for me to showcase the capabilities and adaptability that FlexQuery offers (we use the ‘Flex’ prefix for a cause).
Information supply driver
FlexQuery communicates with knowledge sources utilizing connectors. The connectors use bodily connections created by the FlexQuery knowledge supply drivers. So, naturally, we needed to create a brand new FlexQuery knowledge supply driver.
I do not wish to focus on the contract for our FlexQuery connector drivers now. For sure, the contract doesn’t require that the info supply assist SQL — the drivers are free to run queries or checklist accessible flights on arbitrary knowledge sources and use fully customized queries or checklist payloads.
Because the server internet hosting the desk capabilities is an Arrow Flight RPC server with normal semantics, we created a FlexQuery driver that may hook up with any Flight RPC server after which:
- Run queries by making GetFlightInfo->DoGet calls whereas passing arbitrary Flight descriptor offered by the caller
- Carry out ListFlights and go arbitrary itemizing standards offered by the caller
This driver is then loaded into FlexQuery’s connector infrastructure, which addresses the boring considerations resembling spinning up precise knowledge supply connectors, distributing replicas of information supply connectors throughout the cluster, or managing connection swimming pools.
As soon as the driving force is loaded into the FlexQuery cluster, it’s attainable so as to add new knowledge sources whose connections are realized by the driving force = FlexQuery can now hook up with Flight RPC knowledge sources.
Integrating with the semantic mannequin and compute engine
The system’s capability to bodily join and run queries on a Flight RPC knowledge supply is just not the tip of the story. The following step is integrating into the semantic mannequin, working advanced computations on high of information obtained from desk capabilities, and post-processing the intermediate outcomes to their closing type.
Integration right into a semantic mannequin may be very simple: all we needed to do was add an adapter to the FlexQuery part, which is accountable for discovering semantic mannequin datasets inside an information supply.
On this case, the adapter calls out to ListFlights of the info supply after which performs some simple conversions of data from FlightInfo to the semantic mannequin’s knowledge courses. The FlightInfo incorporates details about the perform and an Arrow schema of the consequence desk; these may be mapped to a semantic mannequin dataset in almost 1-1 trend: a desk perform turns into a dataset, and the dataset’s fields are derived from the Arrow schema.
Declare Datasets Queryable
To make the datasets queryable, we needed to declare to our SQL question builder (constructed on high of Apache Calcite) that each one datasets from the brand new knowledge supply may be queried utilizing SQL with the DuckDB dialect.
This enabled the next:
- Simplified Question Constructing: When a consumer requests a fancy report involving this knowledge supply, the SQL builder treats all desk capabilities as present tables inside a single DuckDB database.
- Correct Column and Desk Mapping: When our SQL builder creates SQL, it additionally pinpoints all of the tables and columns of these tables which can be used within the question.
Add an Adapter to the Execution Orchestration Logic
Because the new knowledge supply doesn’t natively assist SQL querying, we have to alter our execution logic:
- Using FlexQuery’s Information Service: FlexQuery already has an information service that permits SQL post-processing utilizing DuckDB. We leveraged this by dispatching the SQL created by our builder to this present service and accurately specifying find out how to acquire knowledge for various “tables.”
- Information Retrieval for Every Desk: Information for every desk is obtained by calling the FlexQuery connector to the newly added knowledge supply. The code constructs a Flight Descriptor that incorporates all important distant perform invocation parameters — on this case, the complete context of the report execution.
- Optimizing Information Dealing with: As each an optimization and comfort, FlexQuery additionally signifies to the distant desk capabilities which columns are important within the specific request. This enables the perform to (optionally) trim pointless columns.
Performing Publish-Processing of Intermediate Outcomes
To additional course of intermediate outcomes, we reused the present knowledge service that handles reshaping and sorting any Arrow knowledge in our system. This step is achieved just by piping the results of the SQL question to the post-processing knowledge service and tailoring the Flight descriptors accordingly, offering an environment friendly and seamless post-processing workflow.
And that’s it! Now our product helps the invocation of distant desk capabilities and may even carry out advanced computations on the info they supply. This new logic is hooked seamlessly into the remainder of FlexQuery, so these outcomes can then be cached for repeated reads and/or handed to additional post-processing utilizing Pandas/Polars.
The next diagram outlines the structure after we added assist for distant desk capabilities. We added new plugins and enhanced present elements with a number of adapters (depicted in inexperienced) — all the opposite elements have been already in place. We reused these present elements by crafting the info processing requests in another way.
Word that on this image, each time a part is about to carry out a presumably advanced computation that leads to write of both the intermediate or closing consequence, it can first test for the existence of the info within the cache; requests are short-circuited, and the caller is as an alternative routed to work with the info from the cache.
The very best half was that we already had a lot of the companies and structure fundamentals prepared from our earlier effort to allow analytics on high of CSV information.
FlexConnect Introduction
The results of the structure outlined above is what we name FlexConnect:
- Builders can construct customized FlexConnect capabilities and run them inside a Flight RPC server. The server can then be built-in as an information supply with GoodData as common datasets.
- The info generated by capabilities can be utilized in studies by any consumer.
- Quickly™, we can even add the primary degree of federation capabilities — knowledge mixing — which can enable customers to combine knowledge from FlexConnect capabilities with knowledge from different knowledge sources.
A FlexConnect perform is a straightforward contract constructed on high of Flight RPC that permits the info service to show customized knowledge era algorithms. The components which can be particular to GoodData are the instructions included within the Flight descriptors.
To assist bootstrap the event of the FlexConnect server that hosts customized capabilities, we now have expanded our GoodData Python SDK and included a number of new packages and a template repository that can be utilized to construct production-ready Flight RPC servers, exposing Flex Capabilities shortly.
FlexConnect Template Mission
A superb entry level is the FlexConnect template repository. Its function is to quick-start the event of FlexConnect capabilities; the builders can concentrate on the code of the perform itself and won’t want to fret in regards to the Flight RPC server in any respect. The template is a Python mission and makes use of PyArrow beneath the covers.
We constructed this template in response to the practices we observe when growing Python libraries and microservices ourselves. It’s arrange with linter, auto-formatting, and type-checking. It additionally comes with a pattern Dockerfile.
GoodData Flight Server
The Flight RPC server that runs and exposes the developed FlexConnect capabilities is constructed utilizing the gooddata-flight-server package deal. That is our opinionated implementation of a Flight RPC server, influenced closely by our manufacturing use of Arrow Flight RPC.
It’s constructed on high of PyArrow’s Flight RPC infrastructure and solves boring considerations associated to working Flight RPC servers in manufacturing. It additionally offers the essential infrastructure to assist producing Flights utilizing long-running queries.
This server is in any other case agnostic to GoodData — you may even use it for different Flight RPC knowledge companies that don’t understand FlexConnect. The server offers with the boilerplate whereas the developer can plug the precise implementation of the Flight RPC strategies.
FlexConnect Contracts
The FlexConnect capabilities internet hosting and invocation are applied as a plugin to the GoodData Flight RPC server (described above). You could find the code for this plugin in our Python SDK.
Inside this plugin implementation, there are additionally schemas that describe payloads used throughout the perform invocation. When GoodData / FlexQuery invokes the perform, it gathers all of the important context and sends it as perform parameters.
These parameters are included as JSON contained in the command despatched within the Flight descriptor. Now we have additionally printed the JSON schema describing the parameters. You could find the schemas right here.
This manner, if builders want to implement your entire Flight RPC server (maybe in a distinct language) that exposes distant capabilities suitable with GoodData, all they must do is be sure that the implementation understands the incoming payloads as they’re described within the schemas.
Conclusion
I hope that this text was useful for you and offered an instance of how Apache Arrow and the Flight RPC can be utilized in apply, and the way we at GoodData use these applied sciences to satisfy advanced analytics use circumstances.
Talking as an architect and engineer, I need to admit that since we adopted Apache Arrow, we now have by no means seemed again. Mixed with the sound and open structure of FlexQuery, we’re constructing our product on a really stable and well-performing basis.
Be taught Extra
FlexConnect is constructed for builders and corporations seeking to streamline the mixing of numerous knowledge sources into their BI workflows. It offers you the flexibleness and management you have to get the job executed with ease.
Discover detailed use circumstances like connecting APIs, working native machine studying fashions, dealing with semi-structured NoSQL knowledge, streaming real-time knowledge from Kafka, or integrating with Unity Catalog — every with its personal step-by-step information.
Need extra particulars in regards to the larger image? Join with us by our Slack group for assist and dialogue.