Querying Data with SQL in BigQuery

In this tutorial, we will explore how to query data using SQL in BigQuery. We’ll start with writing basic SQL queries, move on to advanced functions such as window functions and arrays, and conclude with query optimization techniques.

1. Writing Basic SQL Queries in BigQuery

Let's begin by understanding how to write basic SQL queries in BigQuery. BigQuery supports standard SQL, allowing you to perform queries to retrieve, manipulate, and analyze data.

Loading the Data

To use the dataset in BigQuery, you first need to load the CSV into BigQuery as a table. If you haven't already, follow these steps:

  1. Upload the CSV to a Google Cloud Storage bucket.
  2. Create a dataset in BigQuery.
  3. Load the data into a table by specifying the file format (CSV) and schema (columns like brand, model, price, rating, etc.).
Note: The examples in this tutorial are based on the laptop_prices.csv dataset provided.

Basic SQL Query Example

Once the data is loaded, you can perform a simple query to retrieve information. For example, if we want to select the laptop brands and their prices, we would write:

Alps

Output

Alps

This query retrieves the first 10 rows of Company and Price_euros from the table.

Filtering Data

To filter data, you can use the WHERE clause. For example, if you want to find laptops that cost more than 1000 units:

Alps

Output

Alps

This query returns all laptops where the price is greater than 1000 units.

Sorting Data

If you want to see the most expensive laptops, you can order the results by price in descending order:

Alps

Output

Alps

This query returns the top 5 most expensive laptops.

2. Advanced SQL Functions (e.g., Window Functions, Arrays)

BigQuery provides advanced SQL functions that allow for more complex data manipulation, such as window functions and array handling.

Window Functions

Window functions help perform calculations across sets of table rows that are related to the current row. For example, if you want to rank laptops based on their price, the RANK() function can be used:

Alps

Output

Alps

Here, the laptops are ranked according to their price, with the highest price getting rank 1.

Arrays in BigQuery

BigQuery supports arrays, which can be useful for creating lists or handling data collections. For example, to get an array of models for each laptop brand, use the ARRAY_AGG() function:

Alps

Output

Alps

This query groups all laptop models by brand and returns an array of models for each brand.

3. Query Optimization Techniques for BigQuery

BigQuery is designed to handle large-scale data efficiently, but it’s essential to write optimized queries to minimize cost and improve performance. Below are some optimization techniques you can use when querying data in BigQuery.

1. Limit the Data Scanned

One of the most important optimization techniques is limiting the amount of data scanned. Using SELECT only for the columns you need and adding WHERE clauses to filter rows early can reduce the query cost. For example, if you only need the compamy and price, there’s no need to query all columns:

Alps

Output

Alps

This query filters the data and selects only the brand and price columns, which reduces the amount of data scanned.

2. Use Caching for Repeated Queries

BigQuery caches the results of your queries for 24 hours. If you run the same query multiple times, caching can help avoid additional costs.

For example, if you frequently run this query to find high-priced laptops:

Alps

If you run this query again within 24 hours, BigQuery uses the cached results without scanning the table again.

3. Use Approximate Aggregations

For large datasets, using approximate functions like APPROX_COUNT_DISTINCT() can speed up queries. For example, to get an approximate count of distinct laptop brands, you can use:

Alps

Output

Alps

This is faster than using COUNT(DISTINCT) and provides nearly the same result.