Build an End-to-End Data Capture Pipeline using Document AI

Set up Storage buckets, BigQuery dataset, and Pub/Sub topic.

Check my progress

Create process-invoices Cloud Function.

Check my progress

Create geocode-addresses Cloud Function.

Check my progress

Check pipeline has processed data.

Check my progress Test and share your knowledge with our community! Get access to over 700 hands-on labs, skill badges, and courses

Build an End-to-End Data Capture Pipeline using Document AI

Lab 1 hour universal_currency_alt 1 Credit show_chart Introductory info This lab may incorporate AI tools to support your learning. Lab instructions and tasks Test and share your knowledge with our community! Get access to over 700 hands-on labs, skill badges, and courses

GSP927

Overview

The Document AI API is a document understanding solution that takes unstructured data, such as documents and emails, and makes the data easier to understand, analyze, and consume.

In this lab, you'll build a document processing pipeline to automatically analyze documents uploaded to Cloud Storage. The pipeline uses a Cloud Function with a Document AI form processor to extract data and store it in BigQuery. If the form includes address fields, the address data is sent to a Pub/Sub topic. This triggers a second Cloud Function, which uses the Geocoding API to add coordinates and writes the results to BigQuery.

This simple pipeline uses a general form processor to detect basic form data, like labeled address fields. For more complex documents, Document AI offers specialized parsers (beyond the scope of this lab) that extract detailed information even without explicit labels. For instance, the Invoice parser can identify address and supplier details from an unlabeled invoice by understanding common invoice layouts.

The overall architecture that you will create looks like the following:

  1. Upload forms with address data to Cloud Storage.
  2. The upload triggers a Cloud Function call to process the forms.
  3. Document AI called from Cloud Function.
  4. Document AI JSON data saved back to Cloud Storage.
  5. Form Data written to BigQuery by Cloud Function.
  6. Cloud Function sends addresses to a Pub/Sub topic.
  7. Pub/Sub message triggers Cloud Function for GeoCode processing.
  8. Geocoding API called from Cloud Function.
  9. Geocoding data written to BigQuery by Cloud Function.

This example architecture uses Cloud Functions to implement a simple pipeline, but Cloud Functions are not recommended for production environments as the Document AI API calls can exceed the timeouts supported by Cloud Functions. Cloud Tasks are recommended for a more robust serverless solution.

Objectives

In this lab, you will learn how to:

You'll configure a Cloud Function to:

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

  1. Click Activate Cloud Shell at the top of the Google Cloud console.

When you are connected, you are already authenticated, and the project is set to your Project_ID, . The output contains a line that declares the Project_ID for this session:

Your Cloud Platform project in this session is set to >>

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  1. (Optional) You can list the active account name with this command:
  1. Click Authorize.

Output:

  1. (Optional) You can list the project ID with this command:

Output:

[core] project = >> Note: For full documentation of gcloud , in Google Cloud, refer to the gcloud CLI overview guide.

Task 1. Enable the APIs required for the lab

You must enable the APIs for Document AI, Cloud Functions, Cloud Build, and Geocoding for this lab, then create the API key that is required by the Geocoding Cloud Function.

  1. In Cloud Shell, enter the following commands to enable the APIs required by the lab: gcloud services enable documentai.googleapis.com gcloud services enable cloudfunctions.googleapis.com gcloud services enable cloudbuild.googleapis.com gcloud services enable geocoding-backend.googleapis.com
  2. In the console, in the Navigation menu (), click APIs & services > Credentials.
  3. Select Create credentials, then select API key from the dropdown menu.

The API key created dialog box displays your newly created key. An API key is a long string containing upper and lower case letters, numbers, and dashes. For example, a4db08b757294ea94c08f2df493465a1 .

  1. Click Edit API key in the dialog box.
  2. Select Restrict key in the API restrictions section to add API restrictions for your new API key.
  3. Click in the filter box and type Geocoding API.
  4. Select Geocoding API and click OK.
  5. Click the Save button.

Task 2. Copy the lab source files into your Cloud Shell

In this task, you copy the source files into your Cloud Shell. These files include the source code for the Cloud Functions and the schemas for the BigQuery tables that you will create in the lab.

Task 3. Create a form processor

Create an instance of the generic form processor to use in the Document AI Platform using the Document AI Form Parser specialized parser. The generic form processor will process any type of document and extract all the text content it can identify in the document. It is not limited to printed text, it can handle handwritten text and text in any orientation, supports a number of languages, and understands how form data elements are related to each other so that you can extract key:value pairs for form fields that have text labels.

  1. In the console, open the navigation menu and select Document AI >Overview.
  2. Click Explore Processor and Click Create Processor for Form Parser.
  3. Specify the processor name as form-processor and select the region US (United States) from the list.
  4. Click Create to create your processor.

You will configure a Cloud Function later in this lab with the processor ID and location of this processor so that the Cloud Function will use this specific processor to process sample invoices.

Task 4. Create Cloud Storage buckets and a BigQuery dataset

Prepare your environment by creating the Google Cloud resources that are required for your document processing pipeline.

Create input, output, and archive Cloud Storage buckets

Create input, output, and archive Cloud Storage buckets for your document processing pipeline.

Create a BigQuery dataset and tables

Create a BigQuery dataset and the three output tables required for your data processing pipeline.

You can navigate to BigQuery in the Cloud Console and inspect the schemas for the tables in the invoice_parser_results dataset using the BigQuery SQL workspace.

Create a Pub/Sub topic

Initialize the Pub/Sub topic used to trigger the Geocoding API data enrichment operations in the processing pipeline.

Task 5. Create Cloud Functions

Create the two Cloud Functions that your data processing pipeline uses to process invoices uploaded to Cloud Storage. These functions use the Document AI API to extract form data from the raw documents, then use the GeoCode API to retrieve geolocation data about the address information extracted from the documents.

You can examine the source code for the two Cloud Functions using the Code Editor or any other editor of your choice. The Cloud Functions are stored in the following folders in Cloud Shell:

The main Cloud Function, process-invoices , is triggered when files are uploaded to the input files storage bucket you created earlier.

The function folder scripts/cloud-functions/process-invoices contains the two files that are used to create the process-invoices Cloud Function.

The requirements.txt file specifies the Python libraries required by the function. This includes the Document AI client library as well as the other Google Cloud libraries required by the Python code to read the files from Cloud Storage, save data to BigQuery, and write messages to Pub/Sub that will trigger the remaining functions in the solution pipeline.

The main.py Python file contains the the Cloud Function code that creates the Document-AI, BigQuery, and Pub/Sub API clients and the following internal functions to process the documents: