Simple Data Stack

Shawn Cao
6 min readNov 20, 2021

--

TL;DR

We’re living in an age with a huge market demand to turn data into insights, while a modern data stack is great in solving a variety of technical problems to help many companies build up data solutions, we need a simple data stack to serve even more businesses with similar needs. A simple data stack simplifies users’ access to data, big data, and real-time data by abstracting the complexity away.

Context

You may have heard these terms, or at least one of them: “Data is exploding”, “All business will be data-driven”, and “The data age is here”. This is true, as over 2.5 quintillion bytes of data are generated every day according to this market research, the value of the data analytics market alone will reach $105B by 2027. Many businesses have the need but don’t have an IT department, or have just a little engineering capacity, they would rather focus their limited time on the core business rather than the technology itself.

In the 3 sections of this article, we will do a bit of analysis on current situations of both market and technology, and discuss why we need a simple data stack now:

  1. The market of businesses that have limited or no IT/technology department.
  2. A glance of modern data stack beyond the traditional database.
  3. Talk about the concept of a simple data stack.
The big data era is not over, it has just begun.

Old Problems, Developed Technologies, and New Markets

Big data problems have been tackled since a decade ago. Internet giants like Google and Facebook have led the way initially, but the industry picked it up quickly and many businesses benefit from the prosperity of open source software — these names became historical records witness to the maturing process of the big data technology: MapReduce, Hadoop, Hive, Spark, Presto… this is just from the computing perspective, however, as “big data problems” are not just distributed as computing problems, instead of spanning a wide spectrum if we are looking from a view of tech stack:

  • Storage: HDFS, Cloud Storage (S3, GCS, Azure Blob), distributed memory stores, KV stores (Rocks DB), etc.
  • Data warehouses: Snowflake, Firebolt, Delta Lake, Apache Iceberg, Starburst Data, etc.
  • Analytics DB: Druid, Pinot, Clickhouse, Nebula, etc.
  • Streaming: Apache Kafka (vectorized Redpanda), Apache Pulsar, Apache Flink, Spark Streaming, etc.
  • ETL / Ingestion: Many different solutions, such as Apache Airflow, DBT, etc.
  • Data Science: All types of Notebooks — Jupiter, Kaggle, Observable, mostly dominated by Python libs — NumPy, Pandas, etc.
  • Analytics Products: Tableau, Power BI, etc.
  • ML/AI: Pytorch, Tensorflow, etc.

(Disclaimer: names are just examples within the writer’s knowledge that help picture the stack layer, they do not define positions)

In Chinese, there is a term called “Hundreds Of Flowers Blossoming Simultaneously (百花齐放)”. It means many great things appear and show off at the same time, which is indeed the case during this time.

But have you noticed? We have a problem with “accessibility”. We say “data is available for every individual business/person”, but the majority of them are not able to deploy and operate on these technologies by themselves because of the complexity.

The reasons for this accessibility problem are:

  1. Limited engineering capacity. Most big data technologies are beyond the engineering capacity of most companies. 99% of the companies in the US have less than 50 employees, for most of them, there are less than 5–10 software engineers or IT professionals.
  2. Dev/Ops cost is high. Every single technology has lots of internals to understand before you can operate and maintain it well. Lots of engineers spend time every day dealing with configuration tuning, complex API, and second-time development.
  3. Multi-point touches. To build an end-to-end scenario, operating on a single technology is not enough. For example, to build a simple real-time analytics app, most of the time you need to set up real-time streaming storage, ETL, analytics DB, and analytics product (UI)/notebook.

If you believe this — even just that a small percentage of the small-medium sized companies want to “turn more available data as their competitive strength”, there is a huge market fit to make big data technology accessible, and here is a related post on it.

Technology is pretty developed, but it doesn't serve the 
new rising markets well due to accessibility problems.

Modern Data Stack

I’m guessing this is not the first time you hear the term “modern data stack”. It’s appealing because everybody wants to stick with the modern pace. However, I don’t believe this is an officially defined term, and different people may have a different understanding of it. Here are a few references you can take a look at.

  • A company called “modern data stack”, interesting to browse
  • A post by Preset founder (the creator behind Apache Superset): post
  • A post by Rocket chief product: post

Search the term “modern data stack” on LinkedIn, you’ll see a lot more, :)

Here is a simplified illustration — some common parts based on my limited knowledge:

  1. Collect. Logging facility to send data somewhere, modern data stack infers the destinations as streaming storage rather than a traditional database, sometimes you can log data into a modern data warehouse directly if no data cleaning is needed.
  2. Transform. ETL / data pipeline which “reads, transforms, and writes” to process all messages collected. It could be as simple as scripts to conduct operations like extraction, filtering, pruning, and populating at row basis, or as complex as streaming processing logic executed in Flink.
  3. Index. Indexing is a generic step describing the action to prepare the data ready for querying, analyzing, and reprocessing. Here you may also see ML data pipelines and process signals (features) for training/interfering with your ML algorithms.
  4. Application. Here you may see data explorer tools, analytics and visualization, and applications built for specific scenarios. Finally, you get to get what you want here, what a journey!

Today, in-numerous projects and startups are tackling each stage to enrich the modern data stack further. It’s hard to summarize them in a single sentence, but if I may, many of them are optimizing these areas:

  • System performance: provide alternatives but faster.
  • Business cost: provide alternatives but cheaper.
  • Technology accessibility: provide alternatives but easier to use.
Modern data stack is complex and  is still too difficult 
to apply to many businesses

Simple Data Stack

Modern data stack is presenting huge options to us, it solves part of the accessibility issue by embracing the cloud and SAAS model, but overall it is still complex. There are tons of customers who want to ride the train to transform their “available data” into their “competitive strength”. But modern data stack is not something affordable to them.

While we’re deepening every individual layer to remove friction for users to adopt that specific technology, users still have a variety of problems that they face: they still need to architect how to glue different pieces together. To become a data-driven business, accessibility is still a big problem awaiting further simplification.

A simple data stack further reduces the friction. For example, the snowflake is a good example of a simple data stack, many companies only need one offer to supply most of their data needs. A simple data stack does more behind the scenes to simplify the users’ access to big data.

Ideally, a simple data stack provides a unified API set for users to send data, process data, and query data. It works like a magic box. Discluding ingress API and egress API, the majority of other APIs are optional hooks depending on the business needs:

  • connect: <sources>
  • send: list<messages>
  • query: SQL, GraphQL, or similar equivalents.
  • (Optional) extraction/transform/sink/subscribe/etc.: executed as lambda async functions.

Simple data stack abstracts the complexity of operating individual components, providing a default functional stack for scenarios that only need minimal development. However, those optional interfaces make a simple data stack flexible enough to address differences, allow users to own part of the data stack, and seamlessly connect to a simple data stack. For example, users can bring their data collection and connect with a simple data stack for query/visualization/story/app building.

You may be wondering — isn’t this sounding like a traditional RDBMS? Exactly, traditional RDBMS has been proven to be the best way to build applications, and modern data stack should work as a managed RDBMS with an unlimited scale from a user’s perspective, it addresses concerns of:

  • Real-time nature
  • Big data scale
  • Cost efficiency
  • Complex business logic pluggable
  • Operations free

These are all that we’re looking for from a Simple Data Stack.

A Simple Data Stack is the future of a Modern Data Stack.

End Thanks

Thank you so much for making it to the end! I Just wanted to share some observations and thoughts on this domain I’m sure many of us care about. If you liked my sharing, please consider following me and staying connected!

--

--

Shawn Cao
Shawn Cao

Written by Shawn Cao

Founder of Columns AI, Fina Money. I write about startups, entrepreneurship, technology, management, finance, data, politics, people…

No responses yet