Using GPT-4 Vision for Multimodal Web Scraping
Adrian Krebs |
OpenAI recently released a multimodal version of GPT-4, called GPT-4 Vision (GPT-4V). GPT-4V can understand images as input and answer questions based on it.
80% of the world's data is unstructured and scattered across formats like websites, PDFs, or images that are hard to access and analyze. We believe that this new era of multimodal models will have a big impact on the web scraping and document processing space because it's now possible to understand unstructured data without having to rely on complex OCR technologies or tooling.
Let's explore potential applications of GPT-4V for web scraping.
Our first experiment with GPT-4V was to transform a screenshot of an Amazon product page into structured JSON data - a classic web scraping task that would require us to extract each field based on its CSS selector.
GPT-4V successfully turned the screenshot into structured JSON data.
Our next experiment was around extracting data from charts. I took a chart from the Stack Overflow developer survey 2023 website and tried to structure the data.
As we can see, GPT-4V was able to accurately transform a chart into JSON data. It worked well in this case but struggled with more complex test cases like stock charts, where the data points are more dense and probably harder to recognize and distinguish for GPT.
Complex table data is hard and work-intensive to extract with traditional web scraping methods, so we tested how GPT-4V performs on extracting table data.
Not surprisingly, it did transform all the product specs from a bike into JSON format. One challenge we already see here is that the table is scrollable and we can only process data that is visible on the screenshot and not the full HTML element. More on that in the section about limitations.
A common challenge when extracting data from websites is getting blocked by anti-bot mechanisms such as captchas, so we tried to use GPT-4V to solve captchas.
We discovered that GPT-4V could recognize a CAPTCHA within an image, but it frequently didn't pass the tests. For example, in our test, it successfully detected two buses in the captcha image but missed one.
Robotic Process Automation (RPA)
We then tried to get the coordinates of every button on a website screenshot, which would be very useful for click automation or RPA tasks. Finding links, buttons, and other elements just on the textual representation is often not that easy. And when it comes to desktop applications, the only viable solution for RPA is often traditional OCR techniques.
Although GPT-4V responded with the X and Y coordinates of each button on the screenshot, the coordinates are actually wrong and have a big offset for the X coordinates. This might be due to wrong approximations of the coordinates based on visual estimation. I didn't optimize the prompt further, but it looks like GPT-4V already performs quite well in identifying clickable actions on a screenshot, which would potentially have big implications on what workflows we can automate with RPA tools.
Instead of passing the plain image, we can identify and annotate objects in the image and then send the updated image to GPT-4V. This basically helps GPT-4V to better detect available actions on a screenshot.
Vimium is a Chrome extension that lets you navigate the web with only your keyboard, and we used it to tag all available buttons and links on the screenshot. GPT-4V was then able to figure out where to click based on the displayed keyboard shortcuts.
We believe that image processing, such as adding bounding boxes or object annotations, will make GPT-4V very powerful for RPA.
OpenAI ran tests on an early version of GPT-4V and documented the finding in the official GPT-4V(ision) System Card, revealing several limitations. These include missing text, characters, or mathematical symbols in images, and challenges in recognizing spatial locations and colors.
After running many experiments with website data, we discovered the following limitations for web scraping with GPT-4V:
- Limited context: Processing screenshots is limited to everything visible on the screen. We tried to process full-page screenshots, but this would require slicing into multiple smaller images to fit the output into the GPT context window. Basic RPA capabilities like scrolling and navigating are required in combination with the vision capabilities.
- Limited scalability: Using computer vision to extract data from websites works well on a small scale. Doing that for millions of web pages every day would be very inefficient and costly. As of today, GPT-4V has strict rate limits and is quite slow and expensive.
Our tests with GPT-4V showed promise. It successfully turned screenshots of complex websites into structured data, making a classic web scraping task much easier. It was also able to understand charts and transform the content into JSON, which is something no traditional web scraping method can do.
Although the OCR capabilities of GPT-4V are impressive, it is sometimes still misinterpreted or hallucinated.
In short, GPT-4V could opens new doors for web scraping, document processing, and RPA applications, although it's not yet ready for large-scale operations yet.
Combining GPT-4V's image understanding with GPT's semantic text understanding will allow us to better handle unstructured data, making task like web scraping more powerful and accurate. We think that multimodal LLMs will be complementary to existing data extraction solutions, only being used for data that is too complex to process with just the textual representation, such as charts, tables, or images.
Stay tuned—Kadoa will bring these capabilities to you soon.