Here we’ll cover the steps for loading XML data into Snowflake, and querying said data with ANSI-standard SQL. This will facilitate 100% RDB capabilities for XML, even though its data format constitutes a completely non-relational, document store.
While JSON has superseded XML as the preferred format for featherweight and uncomplicated data exchange, XML maintains a solidly fixed place in corporate IT. Ranging from a wide array, such as desktop office applications to SOA-based enterprise applications, to a host of APIs and industry-level data exchange protocols. With respect to the latter, these include OTA for the travel industry, FpML for financial services, and HL7 for healthcare, so it’s apparent XML is not just here to stay but will continue to grow.
Since XML is so ubiquitous, it’s important to utilize a data warehouse that makes it effortless to query XML and yield insights.
Up until now, alternative approaches have necessitated the transformation of the XML prior to loading it into a SQL database, usually complicating, if not delaying, the entire process since it requires a separate tool to transform the data. Alternatively, one would need to write a custom transformation script for oneself. Using Snowflake, the time has finally come where one can ingest XML data straight into the warehouse – no transformation required. This is thanks to Snowflake’s patented VARIANT data type, which allows XML data to either be inserted into an existing table within Snowflake, or stored as its own columnar database table.
Here’s how it’s done:
Rounding-up the sample XML data
For starters, let’s assemble the XML data itself. As an example, we’ll make use of publicly available US Treasury auction data, which can be found here. That site makes available all Treasury auction disclosures for notes, T-bills, bonds, etc., covering the entire previous decade, formatted as XML. As a whole, one can find approximately 25,000 XML files there. We’ll be using five for our examples herein.
Figure 1 provides a read-out of one of the XML-based, Treasury auction data files:
Loading the XML data into Snowflake
Now we'll load the data into Snowflake, which provides an intuitive UI, making it easy to load and prepare the data to facilitate queries for analysis. Here's an overview of the steps:
• Create a new table with an XML column using Snowflake’s patented VARIANT data type
• Create a new column-oriented file format for XML (“columnarizing” the XML), to be used with the new table
• Load the XML sample data into the XML column of the new table
The actual sequence of these steps is as follows:
Create a new table
• Create a table with XML column with VARIANT data type
• From the Snowflake UI, select the database into which the XML data will be loaded (e.g., ‘TUTORIAL_DB’ for our example here). [To learn how to create a new database, click here].
• From the database tool bar, click on Tables->Create
• Enter the table name as ‘treasury_auction_xml’. The default schema name is ‘Public’. Specify the schema (column and data type as shown):
If you prefer to work with SQL statements for a script or a workbook, you can click on “Show SQL” below the Create Table dialog box to display the equivalent SQL statements that can be cut and pasted into your script. Snowflake provides this option for most selection boxes. Doing so for this example will reveal:
Create a new file format for XML
• As part of the data loading process, columnarize the XML via a new file format:
Select the database ‘TUTORIAL_DB’, click on the table ‘treasury_auction_xml’.
• Click on ‘Load Table’, a series of pop-up windows will be displayed to guide the user through the data loading process. Select the warehouse you designate for this exercise. Next, specify the location of the XML files to be loaded. The dialog box for our example is shown below:
• Next, in the file format step, click on ‘+’ and a pop-up window is displayed. Specify the various options as shown below:
Complete the loading process
Select ‘Load Options’. This is the last step to complete the loading of XML data into the new table with an XML column. A few different error handling options are presented as shown. Select the choice that is appropriate for your requirement.
• For this example, we choose to skip an XML file if an error is detected in parsing. Click on ‘Load’ to finish.
• Load result is displayed in the example below, showing that successful loading a number of XML files into the new table treasury_auction_xml.
Querying the XML data
With the XML data loaded, you can easily query the data in a fully relational manner, expressing queries with robust ANSI SQL. For example, for a quick glance of all of the XML documents loaded into the table, execute this query: SELECT src_xml FROM treasure_auction_xml.
This query will return all of the XML documents in the XML column. The following is a portion of the output:
This is just a quick example of how you can easily query XML with Snowflake with standard SQL. In addition, you have an assortment of XML functions to query XML elements in the XML files. This includes a powerful FLATTEN LATERAL table-value function, which is an un-nesting function that will allow you to access the inherent hierarchical structures within an XML file.
We’ll cover more on querying XML with the second part of this blog post.