Azure Databricks: What Is It and What Can You Do with It?
"I will talk about two sets of things. One is how productivity and collaboration are reinventing the nature of work, and how this will be very important for the global economy. And two, data. In other words, the profound impact of digital technology that stems from data and the data feedback loop." ~ Microsoft CEO Satya Nadella
Collaboration, productivity and data are what Azure Databricks is all about.
Data, of course, is everything. How many F150 trucks did Ford build this year? What is the patient’s heart rate and blood pressure? Where can we find sushi at this time of night?
The iPhone in your back pocket is filled with data and searching for more.
How does that data get organized so you can find it when you need it?
Machines do a lot of that work. But people working in collaboration with other people and other machines make the data driven world go around.
Azure Databricks is a collaboration between Microsoft and the creators of Apache Spark, which is described on its homepage as an "analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing."
Databricks, the company created to commercialize Spark, "provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products," the company states. "Users achieve faster time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership."
One Click Setup and Management
In announcing the collaboration with Databricks in 2017, Microsoft touted Azure Databricks as "a fast, easy and collaborative Apache Spark-based analytics platform that delivers one-click setup, streamlined workflows and an interactive workspace. Native integration with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory and Power BI simplifies the creation of modern data warehouses that enable organizations to provide self-service analytics and machine learning over all data with enterprise-grade performance and governance."
In a Microsoft overview of Azure Databricks, the company explains its value-add: "Azure Databricks features optimized connectors to Azure storage platforms (e.g. Data Lake and Blob Storage) for the fastest possible data access, and one-click management directly from the Azure console. This is the first time that an Apache Spark platform provider has partnered closely with a cloud provider to optimize data analytics workloads from the ground up."
What Does All This Mean?
So here’s this extensive set of data tools, what can you build with them? In January, Databricks provided answers from data industry thought leaders, who focused on the need for solutions to issues organizations face with AI, Big Data and Analytics.
Kamelia Aryafar, chief algorithm officer at Overstock, sees deep learning, which is a class of machine learning algorithms facilitated by the Spark technology, paying dividends for organizations. "Deep learning innovations will create a lot of new AI applications, some of which are already in production and making massive changes in the industry," she is quoted as saying. She noted that Overstock is currently using deep learning to improve marketing projects such as email campaigns.
Other thought leaders quoted by Databricks see the need for the latest data tools to be used to improve long-standing issues including data processing and providing trusted data with "Explainable AI."
Because of the social, economic and commercial implications of the data being generated, "it is critical to develop AI that is explainable, provable and transparent," said Mainak Mazumdar, chief research officer at Nielsen.
Databricks CEO and co-founder Ali Ghodsi finds data processing to still be a challenge in the AI era. "As an industry we tend to believe that data scientists are spending the majority of their time developing models, shares. Truth be told, data processing remains the hardest and most time consuming part of any AI initiative. The highly iterative nature of AI forces data teams to switch between data processing tools and machine learning tools. For organizations to succeed at AI in 2019, they have to leverage a platform that unifies these disparate tools."
Reading between the lines, Databricks provides the platform companies need to leverage.
Databricks concluded the survey on the near future of AI and Big Data, stating: “Solving the world’s toughest data problems starts with bringing all of the data teams together within an organization. Data science and engineering teams’ ability to innovate faster has historically been hindered by poor data quality, complex machine learning tool environments, and limited talent pools. Additionally, organizational separation creates friction and slows projects down, becoming an impediment to the highly iterative nature of AI projects. Much like in 2018, organizations that leverage Unified Analytics will have a competitive advantage with the ability to build data pipelines across various siloed data storage systems and to prepare labelled datasets for model building, which allows organizations to do AI on their existing data and iteratively do AI on massive data sets.”
Practical Use Cases
Healthcare is an area where AI can be used to parse patient data to provide diagnostic and other assistance to medical professionals.
Last June, Databricks announced that it has been working with pharmaceutical and healthcare providers "to improve their drug discovery processes."
"One such customer, the Regeneron Genetics Center (a wholly-owned subsidiary of Regeneron, a leading biotechnology company), has sequenced over 300,000 consented volunteers and paired their de-identified genetic data with de-identified electronic health records to uncover actionable insights for drug discovery and development,” Databricks said in the announcement.
Jeffrey Reid, PhD, Head of Genome Informatics at Regeneron, was quoted as saying: "As this dataset has grown rapidly, we encountered significant barriers in simple tasks, like gathering all of the data for a given analysis, and querying the 10s of billions of results from our studies. Not only has the Databricks Unified Analytics Platform solved these big data problems, but it is enabling everyone in our integrated drug development process – from physician-scientists to computational biologists – to easily access, analyze, and extract insights from all of our data. Drug development is still a long and difficult process rife with failure, but we have already significantly reduced the amount of time it takes to generate important early insights."
Databricks cited the following areas where its platform enables medical researchers to:
- Accelerate discovery with simplified genomic pipelines: Simplify workflows with prebuilt genomic pipelines hosted in the cloud to process large datasets up to 100x faster than existing solutions.
- Innovate faster with interactive, tertiary analytics and AI at scale: Quickly and simply run tertiary analytics and machine learning algorithms on massive genomic datasets with prepackaged frameworks designed to run in parallel.
- Improve productivity across data, analytics and research teams: Create a collaborative environment and shared workspaces for bioinformaticians, computational biologists and researchers to work together across the research lifecycle with shared workspaces, saving teams precious time and resources.
Training for AI, Data, and Machine Learning
Working with AI, Big Data and Machine Learning is the future of application development. If you want to build Azure Databricks skills, Visual Studio Live! New Orleans this April offers a session on AI and Analytics with Apache Spark on Azure Databricks where you will learn:
- About the fundamentals of Apache Spark, Spark SQL and Spark MLlib
- How to use Databricks notebooks
- How to manage clusters and jobs
- How to integrate Azure Databricks with blob storage and Azure Data Lake Store (ADLS)
- How to write Python code for both analytics and machine learning
Find out more here.
Posted by Richard Seeley on 02/20/2019