Querying Data with SQL in BigQuery
In this tutorial, we will explore how to query data using SQL in BigQuery. We’ll start with writing basic SQL queries, move on to advanced functions such as window functions and arrays, and conclude with query optimization techniques.
Let's begin by understanding how to write basic SQL queries in BigQuery. BigQuery supports standard SQL, allowing you to perform queries to retrieve, manipulate, and analyze data.
Loading the Data
To use the dataset in BigQuery, you first need to load the CSV into BigQuery as a table. If you haven't already, follow these steps:
Basic SQL Query Example
Once the data is loaded, you can perform a simple query to retrieve information. For example, if we want to select the laptop brands and their prices, we would write:
This query retrieves the first 10 rows of Company
and Price_euros
from the table.
Filtering Data
To filter data, you can use the WHERE clause. For example, if you want to find laptops that cost more than 1000 units:
This query returns all laptops where the price is greater than 1000 units.
Sorting Data
If you want to see the most expensive laptops, you can order the results by price in descending order:
This query returns the top 5 most expensive laptops.
BigQuery provides advanced SQL functions that allow for more complex data manipulation, such as window functions and array handling.
Window Functions
Window functions help perform calculations across sets of table rows that are related to the current row. For example, if you want to rank laptops based on their price, the RANK() function can be used:
Here, the laptops are ranked according to their price, with the highest price getting rank 1.
Arrays in BigQuery
BigQuery supports arrays, which can be useful for creating lists or handling data collections. For example, to get an array of models for each laptop brand, use the ARRAY_AGG() function:
This query groups all laptop models by brand and returns an array of models for each brand.
BigQuery is designed to handle large-scale data efficiently, but it’s essential to write optimized queries to minimize cost and improve performance. Below are some optimization techniques you can use when querying data in BigQuery.
1. Limit the Data Scanned
One of the most important optimization techniques is limiting the amount of data scanned. Using SELECT only for the columns you need and adding WHERE clauses to filter rows early can reduce the query cost. For example, if you only need the compamy and price, there’s no need to query all columns:
Output
This query filters the data and selects only the brand and price columns, which reduces the amount of data scanned.
2. Use Caching for Repeated Queries
BigQuery caches the results of your queries for 24 hours. If you run the same query multiple times, caching can help avoid additional costs.
For example, if you frequently run this query to find high-priced laptops:
If you run this query again within 24 hours, BigQuery uses the cached results without scanning the table again.
3. Use Approximate Aggregations
For large datasets, using approximate functions like APPROX_COUNT_DISTINCT() can speed up queries. For example, to get an approximate count of distinct laptop brands, you can use:
Output
This is faster than using COUNT(DISTINCT)
and provides nearly the same result.