According to a recent report published by the IDC, the global data volume is expected to increase from 50 ZB to 175 ZB in 2025, and 30% of that data will be generated in real time. It is high time for the analysts for developing and utilizing solution that would help them in ingest, analyze, and learn from that data.
Cloud Data Warehousing
Cloud data warehousing is being regarded as the solution to the heaps of amount of data that is being generated today. When we compare it with the on-premises data warehousing, the advantages that comes with it is matchless. Scalability, Cost effectiveness, higher uptime, and availability is to name a few. Although it might have some disadvantages over the traditional on-premises data warehousing in use today, however, its advantages outweigh them by a huge margin.
The picture below depicts the differences between the two solutions, showing how cloud data warehousing is the new solution that the developers and analysts are looking out for.
Now, after differentiating between the two data warehouse technologies, how does Snowflake differentiate itself from other cloud data warehouse platforms?
This article describes the revolutionary architecture of the Snowflake and proves that how it stands out when you compare it with the rest of the Data Warehousing solutions, both cloud and on-premises.
If we look at the architecture of snowflake, then we come to realize that it is composed of Centralized Storage, Multi-Cluster Compute and Cloud Services. Each layer contributes to making snowflake efficient, performant and scalable solution that Snowflake is.
Let us go through them one by one:
The first component of Snowflake’s architecture is Centralized storage, or Data Storage, as people calls it. It provides the long-term storage of results, i.e., Remote Disk, which is configured on top of Amazon S3, Azure Blob or Google Bucket.
Here, Snowflake stores all the data that is going to be analyzed by the data professionals. It consists of data, structured in databases, schemas, tables, and views.
The second layer is multi-cluster compute, also known as the Query Processing Layer. This is where querying of data is executed with the used of (virtual) warehouses. In contrast to traditional data warehouses which consists of multiple databases, a Snowflake data warehouse queries data using CPU, memory, and cache to perform all kinds of manipulation of data, such as inserting, deleting, updating and selection of data.
These warehouses can differ in size, like the size of a hoodie. Scaling up the warehouse can be beneficial when the complexity of the warehouse increases as well. Increase in hoodie is directly proportional to the costs and performance. So please keep that in mind and perform a ROI (Return on Investment).
Warehouses can also differ in its type of clustering. By default, it consists of single cluster of servers. However, with multi-cluster of servers, while it increases the overall costs, so does the ability of control multiple resources of the predefined warehouse size increase as well.
Raw Data or Warehouse Cache are the other terminologies that are referred to this layer. This cache contains the raw data that was queried recently, and it vanishes once the connection to the DW is dropped. (“What are the different Snowflake components?”)
The last layer is referred to as Cloud Services Layer, or commonly called as a Services Layer. Snowflake’s Services Layer consists of several sub-services, namely:
- Authentication & Security
- Access Control
- Metadata Management
- Data Sharing
- Query Compilation & Optimization
- Infrastructure & User Management
So, Cloud Services provides the user front-end to the users, enabling them to interact with the various distinct services that a particular service provides to them. it also provides an available and distributed metadata store without the need of user compute resources, which means gathering statistics like table size, the current schema you are in, or recently query results can be done without the requirement of having a running warehouse.
Certain DDL (Data Definition Language) commands do not require an active connection to the Data Warehouse, such as commands like Create, Alter, or Drop Database, because of the simple reason that such commands change the metadata of the database, rather than changing the data itself.
Apart from that, certain DML (Data Manipulation Language) commands like Count, Min, Max requires computational power for aggregation of data, rather than having anything to do with Data Warehouse.
Also query compilation, optimization, Infrastructure, user management, authentication and security also happen in this all-important layer of Snowflake. (“What are the different Snowflake components?”) Explaining these in detail would lead us too far, but, nonetheless, it is important to know that they are strictly separated from the other layers within Snowflake.
Data Sharing within the organization and other relevant stake holders have been a struggle for the Database Administrators for a long time as with traditional solutions, companies struggle to share their data, especially outside the organization, which often results in scattered, inconsistent data in various places, partly internal and external. In short: shared-disk and shared-nothing environments.
Snowflake overcome this limitation in highly efficient way, i.e., via the data sharing functionality. Here, your data is stored in a centralized storage, with just references to this data. On a simpler term, this sharing is just like when you reference cell data, present in one location of a worksheet in excel to other location in another worksheet in Excel.
Caching is also a critical feature in the cloud services layer in snowflake’s performance. Metadata and Results cache is managed by this layer: (“What are the different Snowflake components?”)
- Metadata: It is a storage for two things: Tables and Micro-partitions. For tables, the number of rows, table size in bytes and file references are kept. For micro-partitions, units of storage, the count of NULL values, number of distinct values and min/max of all values are cached.
- The Results Cache: As the name implies, it caches the results which were queried recently. This means that if you calculated the total profit by month for this year, and you (or someone else with the same user role) run the exact same query not later than 24 hours after the initial query, Snowflake does not have to spin up or use a warehouse to retrieve this result.