Why it is sometimes useful to use web crawling to retrieve data

michaljanik · June 26, 2023, 11:53am

Scraping Camel acquires data by independently and continuously crawling web pages. It extracts the information it needs from the site. This is a different approach than e.g. generating data from the website operator into data (XML or CSV) feeds, sending it via API and so on. Both approaches have their advantages. In general, I would say that we prefer to connect via feeds or APIs (because of the speed and efficiency of data processing). However, scraping has its uses. For example, in the following situations.

XML feeds do not exist

If the data provider has the necessary information on the web, but cannot generate an XML feed from it. Especially in foreign countries this situation is common. It can also be websites without a shopping cart - wholesale catalogues of goods, tour offers, catalogues of financial products, cultural events, in short websites with multi-page catalogue.

Feeds do not cover some sites

In the case of online stores, this can include category pages, blog content pages, static contact pages, etc. These pages are usually not covered by XML feeds, unlike product pages.

Feeds exist but do not contain the necessary data

In practice, it often happens that XML feeds contain basic information. But there is extra information on the site. For example, about sizes, detailed stock availability, parameters, etc. From the web, a feed can be created using Scraping Camel, which the user can either use in Mergado to create an export, or they can append the existing data to an existing export using the data import rule.

It is possible to get data from a site that is not listed elsewhere.

Scraping Camel can parse the content of a web page and extract data from it that is not explicitly listed elsewhere. We will discuss this functionality in more detail at some point.