What's AI based Web Mining Technology and What Advantage Does It have-Blog-A Leading Web Mining Provider-Google Flight API, Scrape Nordstrom, Scrape Walmart, Scrape Lowes, Scrape Homedepot

Web mining is the application of data mining techniques to discover patterns from the World Wide Web. It uses automated methods to extract both structured and unstructured data from web pages, server logs and link structures. Our AI based web mining technology is a technology that unifies web protocol analysis, data scraping, artificial intelligence, data processing etc. On top of this technology, we have built a distributed system to work with large volumes of information on the public web.

However, in today's world, as data volume on the interenet is explosively growing, the bar to collect large volumes of public data is also getting much higher than everbefore. Thanks to our AI based web mining technology, we are able to collect large volumes of public data without triggering any anti-bot detections.

For non-professionals, they can eaisly run into all kinds of roadblocks such as IP blockings, google captchas, hcaptchas, ...etc. Most of the time, it's quite difficult to deal with these roadblocks. So a common trap most non-professionals get into is that it appears easy to write some python scraping code, but later on they found that their code stops working very quickly or they found their python code doesn't work for some website at all. This is because the anti-bot mechnism can't be eaisly seen on the UI of the webpage. Mordern website can use tens of thousands ways to detect the bot and block your scraper. Also in many cases, those blocking mechnism is way more complicated than they appear on the front end UI. Just take a look at this cloudFlare 5 seconds checking example:

Cloudflare, Inc. is an American web infrastructure and website security company that provides content delivery network and DDoS mitigation services. Its services occur between a website's visitor and the Cloudflare customer's hosting provider, acting as a reverse proxy for websites. Its headquarters are in San Francisco.

The example below is just a reference for you to get some basic ideas on how cloudFlare works and what mechnism it is using to block robots. Once you get some basic understanding, you will realize how tough it is to deal with the roadblockers.

In this example we are going to use the cloudFlare 5 seconds check for the illustration. when you browse some site, you might have seen this:

If you think this is just a smiple browser check, you are probaly wrong. Let's take a deeper look at what is behind this 5 seconeds check, let's use chrome developer tools and intercept the traffic, you will see the websites sending some data to cloudFlare during the check. When it finisheds the 5 seconds check, in chrome dev tools, networking tab, you will see a token is returned by cloudFlare.

Lets' dig more with that token request. We need to pull all the java scripts from the sites and after doing a lot of debugging and searchng, we come to a piece of obfuscated js code that does the encryption to a bunch of variables, such as your browser's version, your mouse movement, your opertating system version, your browser's cookie parameters, your device's fingerprint etc...

A lot of mass during encryption process, see the screenshot below:

Also the js core algorightm is using highly obfuscated variable names,

If you dig into the js code, you will find a lot of intermediay variables with strange names, for exmaple,

Well, we are not going too far here since the purpose is not to show how exactly everything works, but just to illustrate that dealing with cloudFlare's bot detction isn't as easy as it appears from the UI. It could be a big black hole if you want to find out exactly it works.

So here it comes to the qustions, will a reguar bot be able to bypass this? The answer is unfortunately NO. Because a regular bot can be easily blocked by these roadblocks. Especially for those who are not professional bot engineers, the bot they wrote doesn't have any intelligence to bypass the roadblocks. When buliding our AI based web mining system, we have spent significants efforts tweking and testing the system so that it can bypass most of these kind of obstacles during the data gathering process.

What's AI based Web Mining Technology and What Advantage Does It have

You May Also Be Interested In