Introducing Nebula

Meet the extremely fast and horizontal scalable real-time analytics engine with ZERO dependencies.

Shawn Cao
4 min readApr 23, 2020
Nebula real time data analytics (Screenshot by Author)

I have been fascinated by fast big data analytics, it enables every single of us to explore massive data interactively, this is so important for modern companies/organizations operated as data driven.

My dream analytics platform should be fast, scalable, reliable, real-time, and user friendly. Well, it’s a big challenge to design a system to satisfy all these metrics in need, we got do a great job to balance these features in a single system to delight users.

Highlights

About one year ago, I started a new project on github called `Nebula`, I name it `Nebula` to imply that it’s going to work for unbounded big data analytic tasks, it is certainly a big goal but started with some small fundamental building blocks:

  1. A columnar store with open memory layout, schema, and variable column-based encodings.
  2. A query engine with pluggable UDF/UDAF interface, full spectrum of built in arithmetic/logical operations.
  3. An API library addressed by a full DSL which could be easily to back SQL language as user interface.
  4. A distributed system to manage a horizontal scalable cluster, and provide cluster intelligence to route a compute query in an efficient way.
  5. A web UI and REST endpoints to help users to access Nebula in a simple friendly UI — Nebula turns your data into RESTFUL API, and the web UI help users to finish their majority analytics with NO-CODE experiences.
  6. A package with almost ZERO dependencies allow users to deploy/run Nebula so easy.
  7. A pluggable architecture enables user to add any new data sources (S3, GCS, Kafka, …), new environment support (AWS, GCP, Azure, …) and new data source formats.

Even though Nebula didn’t get full time care, but after 15 months crafting, I’m pleasant to write this introduction to invite you take a look, so far:

  • Closely embrace Hadoop ecosystem, Nebula can ingest data from Parquet, CSV, JSON, Thrift data format, and even online spreadsheets.
  • Support AWS S3 and local file system for static data analysis.
  • Support connect Kafka for real-time data analysis.
  • Support D3 based visualization for timeline, charts, etc.
  • Support column level access control (need to configure your own auth system).
  • Support sparse storage for extreme fast analysis with rich filtering on multi-dimensional metrics data.
  • Support on-demand data loading through REST API so that you can load/unload data for ephemeral analysis scenarios.
  • Support massive number of build-in functions: count, min/max, sum, average, percentiles, etc.

First Impression

As an example illustration, we can fit Nebula in a normal big data environment like this — green part belongs to Nebula.

Bird-eye architecture (Diagram by Author)

According to this picture, besides its computational capability to transform your data, Nebula can be used as a middle-tier between your compute cluster and storage cluster (or cloud), as a performant data cache and data gateway tier providing tabular data access, column-level access control, as well as data streaming.

So far, I have painted a simple picture about what Nebula is and what Nebula can do, but you know, to finish the whole picture, we need a lot of more work, Nebula today has a good foundation and architecture skeleton, this solid foundation enables it to grow more and more useful in future days. If you’re interested, please check it out and consider making your contribution, appreciate any comments, study, questions, contributions.

Project Information

Nebula project is hosted here: https://github.com/varchar-io/nebula

I also started writing Nebula internals through posts shared here: https://nebula.bz

Trying Nebula out is as simple as 3 steps (on mac or linux/ubuntu):
1. git clone https://github.com/varchar-io/nebula
2. cd nebula && ./run.sh (you would need to have yarn installed)
3. open http://localhost:8088 in your browser.

If you want to run it from source, follow the dev instruction to build Nebula.

For single node local run, Ubuntu 18.04 + GCC-9/10 is preferred, in addition, Nebula relies on these awesome building blocks as its foundation:

  • gRPC — communication for client-server, and between nodes in cluster.
  • folly — Futures for parallel processing and other data structures.

Thanks to all these open sourced components that we can run Nebula up, salute to all contributors.

Once Nebula is up, a test data source can be played with through its web UI. Let’s take a look at a few examples after we bring up the single box Nebula.

E1: Count event per flag for time by time —

timeline-bar for event count per flag (Screenshot by Author)

E2: P75 and P90 of value column per tag for life time —

life time p75/p90 value per tag (Screenshot by Author)

E3: Any visual types as you like:

doughnut chart for avg value aggregated by both tag and flag (Screenshot by Author)

Thanks for reading till so far! I hope this quick/simple start will draw some of your interests in this actively developed project.

If you have any feedback, questions, don’t hesitate to ask by issues in the public repo https://github.com/varchar-io/nebula, or you can send your notes to me on this article.

Have a good one!

--

--

Shawn Cao

Driving the mission to make data science technology accessible for all.