3 Industry Leaders on the Future of Data Tooling

Specialized tooling has proliferated the data engineering community. This trend is growing, and the data infrastructures of modern organizations are becoming more modular. There’s a bright future ahead for data engineering, one in which the tools and technology we depend on are increasingly designed with depth and cohesion in mind.

We reached out to three leaders in the data engineering field who are among those leading the charge toward our modular future. We wanted to know:

“What technology or tools in the data science and/or data engineering space are you most excited about right now and why?”

Drew Banin, Co-founder Fishtown Analytics, core contributor to dbt

“The rise of the data warehouse is changing the landscape of the analytics world. At many organizations, the data warehouse unites all data practitioners: the scientists, the engineers, and all of the analysts in between. Whereas these folks historically worked with disparate datasets and used different tools, they now coalesce around the warehouse. As a result, these organizations are building analytical equity. The data cleansing and munging completed by an analyst for a specific BI report can be reused by a data scientist to, say, forecast revenue. This analytical equity means that modern organizations are doing more with their data, faster, and with fewer mistakes.

To this end, we’re building an open-source data modeling tool called dbt (data build tool). dbt harnesses the power of modern warehouses, and provides an engineering-inspired development workflow for analytics. With dbt, analysts can transform their raw data using SQL, then “materialize” these transformed datasets back into their warehouse. These transformations frequently depend on each other, making arbitrarily complex transformations straightforward to express in code. I’m tremendously excited about the growth, and potential of dbt. We’re seeing analysts solve their own problems where they previously needed to open tickets with an engineer. Moreover, these analysts are using git, opening pull requests, and using integration tests to verify the correctness of their logic. If you’re interested in checking out dbt, you can do so here.”

Chris Merrick, VP of Engineering @ Stitch Data

“I’m most excited about the decline of a technology: Hadoop. It’s a bear to work with and it’s actually remarkably inefficient for all but the largest data workloads, so it was really only appropriate for use in the largest organizations (though many engineers, including yours truly, tried it at smaller companies). Many new technologies now exist that retain Hadoop’s key benefit - distributed file storage separated from distributed processing - but are much more accessible, like BigQuery, Redshift Spectrum, Athena, Snowflake, Presto, and Spark. These technologies allow organizations of any size to store and retain all of their data affordably and only pay the cost of processing when they actually need to use it. In fact, because some of these new technologies work with Hadoop components like the Hadoop FileSystem (HDFS), many parts of the Hadoop ecosystem have bright futures, even if other parts like Hadoop MapReduce do not.”

James Campbell, Data Scientist/Researcher @ Laboratory for Analytic Sciences; core contributor to Great Expectations

“I’m most excited about tools that help manage and compose sets of models at a very practical level so people can more rapidly iterate on each others’ work. From improving the plumbing that gets data to the right places (yay for awesome message queues and democratized distributed compute), to expressive ways of expressing semantically rich tests for a system (great expectations!), to facilitating feedback and continuous retraining of models (the commingling of continuous integration and containerization practices into data science workflows), I see a lot of new developments and opportunities for making it easy to deal with complexity. That lets us build really expressive and powerful systems quickly.

Of course, more complex systems introduce a lot of risk as well. It’s like why my dad didn’t want power windows in our car: one more thing to break. I think that’s the exciting (and challenging) opportunity–finding ways to both have the complexity we want and also be able to maintain values we need–like being able to explain our actions and being confident in our reasons. I hope that as those tools become more accessible, we’ll be able to engage more people in thinking about the critical questions of what data we need to build and which questions are truly important. And from a practical perspective, my day is less likely to be ruined when I have tools that make it easy to check my work in good ways. That’s why I’ve helped create Great Expectations: it aims at a sweet spot of providing a really practical way of expressing what we expect to be true in a data system, so we can remain confident in the integrity of a complicated analytic process.”

As for me, your Data Science Evangelist here at Mode, the growth of the tools above have many positive implications for the industry. The growth of dbt means that tried-and-true software engineering principles such as version control, modularity, peer review (and more) are being adopted by analysts and data scientists. The growth of data storage services mentioned by Chris above is a boon for data engineers everywhere. And it’ll be fascinating to see what becomes of Great Expectations.

In addition to those three, I’m also looking forward to the next generation of the pandas library, now commonly referred to as “pandas2”. Jeff Reback gave an excellent talk at PyData NY recently laying out the roadmap for pandas2, which looks to solve many of the common complaints around missing datatypes, black-box UDFs, and subpar performance with medium to large-scale datasets. Many of these improvements rely heavily on the development of Apache Arrow, which aims to achieve a “common data layer” by allowing different data systems to use the same memory format, thus cutting down on computationally expensive serialization and deserialization tasks between systems. While still early in development, it’s exciting to see such a bright future ahead for the beloved and rapidly growing pandas ecosystem.

As I said before, the future is bright for the data engineering community. The proliferation of modular tooling will continue to grow, and data engineers will increasingly benefit from best practices developed in software engineering. We wouldn’t have it any other way.