Some sites block Chrome in headless mode, and we’ll look at how to get around this block.


Diagnostics is the key to all aspects of computers and programming. This article begins with how to deal with this blocking problem yourself. If you are not interested, then you can go directly to the “Solution” item at the end of the article.


If you encounter problems with the headless mode, do not forget to take a screenshot through page.screenshot () to see what happens. At a minimum, this will allow you to find out if you are dealing with the same visible content that is displayed in the “normal” (managed) browser mode, and also to find out if you are stuck in place due to a broken script without understanding anything.


image


In this example, the server itself did not even send the corresponding web page.


The initial answer is the Access Denied page, and that’s all you can get with Chrome in headless mode. What doesn't happen in managed mode at all.


In the diagnostic process, it is important to determine what we know and what we don’t know. Without this step, it is impossible to adhere to a plan of action aimed at what we do not know, and at the same time including only the necessary points. This may seem elementary, but if you do not understand why this is necessary, then to come to this is not so easy. Sometimes diagnostics are understood as going through a list of items, but this only works if an error has already been encountered before.


What do we know? We know that the browser made a single request, and we received a response saying that access is denied. The original page was not rendered and the browser did not send any other requests. This means that the server went somewhere strictly based on what we sent in that first request, and that our blocking has nothing to do with the content of the page. This excludes from the diagnostics everything that takes place after rendering the page, and narrows the scope of the diagnostics exclusively to the request. The request itself is a set of bits and bytes sent over the Internet and received by the server.


Comparing HTTP request headers


Since there is (should be) a slight difference between Chrome running in headless mode and Chrome running in "normal" mode, it is logical to assume that the main network stack is the same and there is no difference how the browser passes packet level queries. This suggests that you need to focus only on the contents of the request. You can use a service that returns our HTTP requests (echo service) to us to find the differences between a request made in headless mode and a request made in normal mode. The script below uses http://scooterlabs.com/echo.json to receive the JSON response that the request represents received by server.


const puppeteer=require('puppeteer'); (async() => { const browser=await puppeteer.launch({ }); const page=(await browser.pages())[0]; const response=await page.goto('http://scooterlabs.com/echo.json'); console.log(await response.json()); await browser.close(); })() 

By launching it both in headless mode (by default) and in “normal” mode (by adding headless: false to the startup parameters), you can compare the output in the console to find the differences, if any.


image


time_utc is the time at which we made the request. It varies, but it is unlikely that this is the only source of blocking, unless the site blocks all requests at a specific time of the day.


The header Accept-Language is missing in case of headless mode. In fact, this is a good signal that someone is using a non-standard browser (or browser mode), and that the browser could use the absence of this header in order to block us. This might be my first guess if we also didn’t have the last different header - User-Agent .


The

User-Agent is clearly highlighted.This difference reveals an important detail, and with the help of this header the headless mode gives itself out:


image


The headline for human-driven Chrome is pretty much the same if you remove Headless. User-Agent has long been a basic, ingenuous way to block unwanted traffic. This is a good starting point to answer the question of whether we get what we need.


Blocking by User-Agent is considered to be a simple and rarely used measure of counteraction because of the simplicity of its bypass. In fact, it would be more beneficial for the site to use it not for blocking, but for recognizing unwanted traffic, since visual accessibility is better than not having one.


Solution (a lot of text, didn’t read it)


Solving the blocking problem is as simple as changing the User-Agent header. It can be redefined at the page level using the page.setUserAgent () method. You can install the user agent on the agent for Chrome in “normal” mode, which, at the time of this writing, looks like this: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36 ".


This is all you need to do. This is why the diagnostic approach itself is more important than this solution. These various kinds of obstacles come up all the time when trying to automate sites and often do not find specific answers on the Internet, so you will have to deal with them on your own. Good luck, and feel free to contact me with any questions!

.

Source