A few months ago, Microsoft revealed that they were looking into adding a capability of querying Cosmos Db data through Spark and this immediately got me thinking into the new scenarios this would enable. The most ambitious is the capability of doing Global Analytics, being able to do analytical queries over your live operational data coming from anywhere on the planet. All without having to handle one piece of infrastructure yourself. The crazy thing is that this scenario is not only possible now (in public preview) but it has been dramatically simplified by Microsoft with the integration of Cosmos Db with Synapse Analytics!
Let’s talk a bit about Synapse first. What started as Azure SQL Data Warehouse, it has now evolved into an all-encompassing analytics development service called Synapse Analytics. Microsoft took the power of the MPP database engine of SQL Data Warehouse and added more powerful capabilities like on-demand SQL querying, full Spark integration, Data Lake Store Gen2 tight support as well as an integrated authoring experience. This will be the analytical piece of the Global Analytics solution.
The next piece of the Global Analytics solution is Cosmos Db. In case you are not familiar, Cosmos Db is Azure’s NoSQL Database as a Service, born in the cloud and designed to take advantage of the flexibility, elasticity, and global reach of cloud computing.
One of the main differentiating features of Cosmos Db is the built-in replication capabilities that were part of the core design of the product. Cosmos Db not only makes it easy to replicate data across the globe but offers multi-master capabilities on sites separated by thousands of kilometers. Coupled with several options for dealing with conflicts as well as consistency guarantees, it is the easiest to use, most reliable, and fastest database system that can create a truly global database. This is the operational data piece and the component that truly enabled the “global” aspect of the solution.
There are two other pieces to complete the puzzle. First is a new Cosmos Db capability Microsoft has released called the “Cosmos Db Analytical Store”. This feature takes the data coming live from Cosmos Db and stores it in Parquet format which is compressed, columnar, and optimized for analytics. This is a simple change that only requires enabling the option (for now just for new containers but this will likely change once the preview phase has completed). After this is enabled, the process and the storage are all automatically managed by Azure and you can simply focus on building your business solution.
The final piece is a new Synapse capability called Synapse Link. Cosmos Db is the first data store to support Synapse Link but it is planned to have other stores like Azure SQL Db or Azure PostgreSQL/MySQL to support it as well. Synapse Link allows transparent querying of an external database from a Synapse workspace without the need to code any sort of ETL. In the case of Cosmos Db, because it uses the “Analytical Store” it does not impact the performance of the operational Cosmos DB databases at all, the compute is all spent on the Synapse side. Implementation details for other data stores like Azure SQL Db are not known at this time but it would be great if Microsoft stays with this pattern of not impacting the operational stores resources.
So now with all the pieces, let’s think about Global Analytics. I can deploy a Cosmos Db database in West US, another one in Germany, and another one in Japan. I can set them to be Multi-Master so my local sites in those three geographies will have the lowest latency for their local data as well as transparent access to the data from the other sites. Then, I create a Synapse Analytics workspace, say in West US (if that’s where HQ might be located) and the Cosmos DB in West US has the Analytical Store activated.
Now, this is where the magic happens. Because I have Multi-Master enabled between West US, Germany, and Japan this means that the Analytical Store in my West US copy will contain the data from all 3 global sites! And then my Synapse workspace simply connects to this Analytical Store and can query this data transparently without any dependency or worry of where in the world the data actually originated! And this was all achieved through service configuration, no actual coding.
Here is a screenshot of how this can be architected in Azure:
Cosmos Db databases here are used for operational data from sites around the globe and Spark on Synapse is used for analytics in West US. Note that I only put one copy of Synapse to illustrate the point but there is no technical limitation here. If I wanted to I could add Synapse in all 3 sites and provide the lowest latency experience where all my users are. Cost is the limiting factor here. And once the data is available as a Spark table then you can use whatever you want to query it or visualize it. Power BI, Tableau, Looker, etc.
As you can see, all of this is Platform as a Service, no piece of actual infrastructure has to be managed by the client. This not only means the barrier of entry for this type of solution is low but it also means that this type of capability is achievable and affordable by all kinds of small and medium businesses. Let’s just think of 5 years ago and the thought of Global Analytics being widely available would have been unheard of! In case you can’t tell already, I’m incredibly excited by how this technology has evolved.
Alright, with the main solution covered, let’s check out the demo where I will walk through how easy it is to perform Global Analytics with the power of these two Azure services. Cheers!