You can barely spend an hour these days without reading about generative AI. While we are still in the embryonic stage of what Some have been dubbed The “steam engine” of the Fourth Industrial Revolution, there’s no doubt that GenAI is shaping up to transform nearly every industry — from finance and healthcare to law and beyond.
Great user-facing apps may attract the most hype, but it’s the companies powering this revolution that stand to benefit the most right now. Just this month, chip maker Nvidia became briefly The world’s most valuable company, a $3.3 trillion juggernaut driven largely by demand for AI computing power.
But in addition to GPUs, companies also need infrastructure to manage the flow of data — to store, process, train, analyze, and ultimately unleash the full potential of AI.
One company looking to take advantage of this is One housea three-year-old California startup he founded Vinoth Chandarwhich created open source Apache hoodie Project while working as a data engineer at Uber. Hoody brings benefits Data warehouses to Data lakescreating what became known as a “data warehouse,” enabling support for actions such as indexing and executing real-time queries on large data sets, whether structured, unstructured, or semi-structured data.
For example, an e-commerce company that constantly collects customer data that includes orders, reviews, and related digital interactions will need a system to accommodate all that data and ensure it is up to date, which could help it recommend products based on user information. activity. Hudi allows data from different sources to be ingested with minimal latency, while supporting delete, update, and insertion (“upsert”), which is vital for real-time data use cases.
Onehouse builds on this with a fully managed data warehouse that helps companies deploy Hudi. Or, as Chandar puts it, “data begins to be ingested and standardized into open data formats” that can be used with almost all major tools in data science, artificial intelligence, and machine learning ecosystems.
“Onehouse gets rid of building low-level data infrastructure, helping AI companies focus on their models,” Shandar told TechCrunch.
Onehouse today announced it has raised $35 million in a Series B funding round as it brings two new products to market to improve Hudi performance and reduce cloud storage and processing costs.
Down at Lake (data).
Chandar created Hudi as an internal project within Uber in 2016, and it has since become a ride-hailing company Donate to the project For the Apache Foundation in 2019, Hoda It has been approved by He loves AmazonAnd Disney and Walmart.
Chandar left Uber in 2019 and, after a short stint at Confluent, founded Onehouse. The startup came out of stealth in 2022 with $8 million in seed funding, and followed that up shortly after with a $25 million seed funding round. Both rounds were led by Greylock Partners and Addition.
Venture capital firms have once again joined forces to pursue a Series B, though this time around David Sacks’s Craft Ventures is leading the round.
“A data lake is rapidly becoming the standard architecture for organizations that want to centralize their data to power new services like real-time analytics, predictive machine learning, and GenAI,” Michael Robinson, partner at Craft Ventures, said in a statement.
In this context, data warehouses and data lakes are similar in the way that they act as a central repository for collecting data. But they do so in different ways: a data warehouse is ideal for processing and querying structured historical data, while data lakes have emerged as a more flexible alternative to storing massive amounts of raw data in their native format, while supporting multiple data types and high-performance querying.
This makes data lakes ideal for AI and machine learning workloads, as it is cheaper to store pre-transformed raw data and, at the same time, has support for more complex queries because the data can be stored in its original form.
However, the trade-off is a whole new set of data management complexities, which threatens to degrade data quality given the wide range of data types and formats. This is partly what Hudi seeks to solve by bringing some of the key features of data warehouses to data lakes, e.g ACID Transactions To support data integrity and reliability, as well as improve metadata management for more diverse datasets.
![Configure data pipelines in Onehouse](https://techcrunch.com/wp-content/uploads/2024/06/65b0f5ffe08c0ab6e68692ea_Ingest-in-minutes-p-1080-e1719227853762.png?w=680)
Because it’s an open source project, any company can deploy Hudi. A quick glance at the logos on Onehouse’s website reveals some impressive users: AWS, Google, Tencent, Disney, Walmart, ByteDance, Uber, and Huawei, to name a few. But the fact that such big-name companies are leveraging Hudi internally speaks to the effort and resources required to build it as part of an on-premises data lake setup.
“While Hudi provides rich functionality to ingest, manage, and transform data, companies still have to integrate about a half-dozen open source tools to achieve their goals of a high-quality data warehouse,” Chandar said.
That’s why Onehouse offers a fully managed cloud platform that ingests, transforms and optimizes data in a fraction of the time.
“Users can get the Open Data Lake up and running in less than an hour, with extensive interoperability with all cloud-native services, warehouses and data lake engines,” Chandar said.
The company has been coy about naming its business clients, other than a couple on the list Case studiessuch as the Indian rhinoceros Apna.
“As a young company, we are not sharing the full list of Onehouse’s commercial clients publicly at this time,” Chandar said.
With a new $35 million in the bank, Onehouse is now expanding its platform with a free tool called Onehouse LakeView, which provides monitoring into Lakehouse jobs to gain insights into table statistics, trends, file sizes, timeline history, and more. This builds on existing observability metrics provided by the Hudi core project, providing additional context for workloads.
“Without LakeView, users need to spend a lot of time interpreting metrics and deeply understanding the entire stack of the root cause of performance issues or inefficiencies in the pipeline configuration,” Chandar said. “LakeView automates this and provides email alerts on good or bad trends and distinct data management needs to improve query performance.”
Additionally, Onehouse is also launching a new product called Table Optimizer, a managed cloud service that optimizes existing tables to speed up the data ingestion and transformation process.
“Open and interoperable”
There’s no ignoring the countless other big-name players in the space. The likes of Databricks and Snowflake are increasingly becoming more prominent. Embrace the lake model: Earlier this month, Databricks has reportedly been distributed $1 billion to acquire a company called Tabular, with the goal of creating a common standard for the lake.
Onehouse has certainly entered a hot field, but it hopes its focus on an “open and interoperable” system that makes it easier to avoid vendor lock-in will help it stand the test of time. It essentially promises the ability to make a single copy of data globally accessible from almost anywhere, including native Databricks, Snowflake, Cloudera, and AWS services, without the need to create separate data silos on each.
As with Nvidia in the GPU space, the opportunities that await any company in the data management space cannot be ignored. Data is the cornerstone of AI development, and not having enough good data is the main reason Why so many AI projects fail. But even when data is available in large quantities, companies still need the infrastructure to absorb, transform, and unify it to make it useful. This bodes well for Onehouse and others like it.
“On the data management and processing side, I believe that high-quality data delivered through a strong data infrastructure foundation will play a critical role in turning these AI projects into real-world production use cases — avoiding garbage in or out of data problems,” Chandar said. “We are starting to see such demand from data warehouse users, as they struggle to scale the data processing and query needs to build these newer AI applications on enterprise-level data.”