AI Made Friendly HERE

Dr. Stavros Papadopoulos, Founder and CEO, TileDB – Interview Series

TileDB is the modern database that integrates all data modalities, code and compute in a single product.  TileDB was spun out of MIT and Intel Labs in May 2017.

Prior to founding TileDB, Inc. in February 2017, Dr. Stavros Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.

You were previously the Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center (ISTC) for Big Data at MIT CSAIL for three years. Can you share with us some key highlights from this period in your life?

During my time at Intel Labs and MIT, I had the unique opportunity to collaborate with luminaries in two different scientific sectors: high-performance computing (at Intel) and databases (at MIT). The knowledge and expertise I acquired became key in shaping my vision to create a new type of database system, which I eventually built as a research project within the ISTC and spun out into what became TileDB.

Can you explain the vision behind TileDB and how it aims to revolutionize the modern database landscape?

Over the last few years, there’s been a huge uptake in machine learning and Generative AI applications that help organizations make better decisions. Every day, organizations are discovering new patterns in their data,and then using this information to achieve a competitive edge. These patterns emerge from an ever-growing spectrum of data modalities that must be housed and managed in order to be harnessed. From traditional tabular data to more complex data sources such as social posts, email, images, video, and sensor data, the ability to derive meaning from data requires analysis in aggregate. As data types increase, this task is becoming much more arduous, demanding a new type of database. This is exactly why TileDB was created.

Why is it crucial for organizations to prioritize their data infrastructure before developing advanced analytics and machine learning capabilities?

Amid the fervor to adopt AI is a critical and often overlooked truth – the success of any AI initiative is intrinsically tied to the quality and performance of the underlying data infrastructure.

The problem is that complex data that is not naturally represented as tables is considered as “unstructured,” and is typically either stored as flat files in bespoke data formats, or managed by disparate, purpose-built databases. Data scientists end up spending huge amounts of time wrangling data in order to consolidate it. It’s estimated that 80-90 percent of data scientists’ time is spent cleaning their data and preparing it for merging. That slows time to training AI algorithms and achieving predictive capabilities. Furthermore, this means that only 10-20 percent of data scientists’ time is spent creating insights.

What are the common pitfalls organizations face when they focus more on AI and ML applications at the expense of a robust database infrastructure?

Organizations tend to focus on shiny new things. Large Language Models, vector databases and generative AI apps built on top of a data infrastructure are current examples, at the expense of addressing the underlying data infrastructure which is crucial to analytical success. Simply put, if your organization does this, you may be left spending an inordinate amount of time cobbling together your data infrastructure and delay or altogether miss opportunities to glean insights.

Could you elaborate on what makes a database ‘adaptive’ and why this adaptability is essential for modern data analytics?

An adaptive database is one that can shape-shift to accommodate all data – regardless of its modality – and store it together in a unified manner. An adaptive database brings structure to data that is otherwise considered  “unstructured.” It’s estimated that 80 percent or more of the world’s data is non-tabular, or unstructured, and most AI/ML models (including LLMs) are trained on this type of data.

TileDB structures data in multi-dimensional arrays. How does this format improve performance and cost-efficiency compared to traditional databases?

The foundational strength of a multidimensional array database is that it can morph to accommodate practically any data modality and application. A vector, for instance, is simply a one dimensional array. By bringing structure to this “unstructured” data, you can consolidate your data infrastructure, significantly reduce costs, eliminate silos, increase productivity, and enhance security. Going a step further, when compute infrastructure is coupled with the data management infrastructure, you can extract instant value from your data.

What are some notable use cases where TileDB has significantly improved data management and analytics performance?

The first TileDB use case was storage, management and analysis of vast genomic data, which is very difficult and expensive to model and store in a traditional, tabular database. We observed phenomenal performance gains (in the order of 100x faster in many cases over other databases and bespoke solutions). However, our multidimensional array model is universal and can efficiently capture other data modalities, too. For example, TileDB is excellent at handling biomedical imaging, satellite imaging, single-cell transcriptomics and point cloud data like LiDAR and SONAR.

TileDB offers open-source tools for interoperability. How does an open source approach benefit the scientific and data science communities?

We are big proponents of open source at TileDB. The core library and data format specification are both open source. In addition, our life sciences offerings, built on top of the core array library, are also open source. This includes TileDB-SOMA, a package for efficient and scalable single-cell data management, which was built in collaboration with the Chan Zuckerberg Foundation and powers the CELLxGENE Discover Census—the world’s largest fully curated single-cell dataset. This too is open source and is used by academic institutions and major pharmaceutical companies across the globe.

What do you see as the future trends in data management?

As data becomes richer, AI applications become smarter. Large Language Models  are becoming more and more powerful, leveraging multiple data modalities, and the integration of these LLMs with diverse data sets is opening up a new frontier in AI known as multimodal AI.

Practically speaking, multimodal AI means that users are not limited to one input and one output type and can prompt a model with virtually any input to generate virtually any content type. We see TileDB as the ideal database for supporting multimodal AI, built to support any new and different types of data that may emerge.

Thank you for the great review, readers who wish to learn more should visit TileDB.


Originally Appeared Here

You May Also Like

About the Author:

Early Bird