Selenium with Chromium and Java on FreeBSD

10 Feb 2020 - tsp

Update: In addition an implementation in Python has been added to show a short draft on how one can get started with Selenium in Python as well. This can be found at the end of the article.

What is this about?

This blog entry is a short description on how to get started using Selenium with chromedriver on FreeBSD with a Java application. This can be used to develop automatic test applications for web applications or simple bots that scrape content from webpages or automate actions on the web using a full browser capable of running JavaScript, running browser plugins, etc.

Note that this is just a short tutorial on how to setup your IDE and write a first simple program that accesses the webpage content and executes click on a single link identified by an XPath expression. It’s not a complete introduction to Selenium or it’s Java interface. If one wants to get a detailed step by step tutorial on how to use Selenium to build a web application testing one can for example refer to Test Automation using Selenium WebDriver with Java: Step by Step Guide by Navneesh Garg (note: Amazon affilate link; this pages author profits from qualified purchases).

Install required software

First one needs a working Chromium installation. This is usually done via packages

pkg install www/chromium

or via ports

cd /usr/ports/www/chromium
make install clean

This automatically installs the chromedriver binary at /usr/local/bin/chromedriver

Now one only needs to fetch the Selenium Java libraries. They can be found at the selenium webpage. Just fetch the Selenium Java package (ZIP file) and save at a convenient location. Unzipping the files yields:

Adding to the Classpath when using Eclipse IDE

When using Eclipse IDE simply start a new Java project, right click your project and select properties. Select Java Build Path and use the Add external JARs function to add both the client-combined-*.jar file (not the -source version) and all JARs from the libs folder to your projects classpath. This will have an effect during build and also while launching from the Eclipse IDE.

Adding Selenium JARs to classpath in Eclipse

When distributing your applications you have to use the method mentioned later on, reference them in the JARs manifest, install the JARs into a system wide known location or (beware of licensing problems!) merge the JARs into a single one.

Adding to the Classpath on the CLI

In case you’re running from your IDE you can simply configure your classpath either by setting the CLASSPATH environment variable in your shells init script or using env CLASSPATH= on each command invocation (or while launching a subshell). This might be done in a wrapper script if desired. Do not forget to add the classpath for your own classes (JAR or directory tree) to your classpath though.

For example, one might use the following invocation:

env CLASSPATH=.:~/selenium/client-combined-3.141.59.jar:~/selenium/byte-buddy-1.8.15.jar:... javac MyTestclass.java

Note that one has to list each and every dependency from the libs folder in this case so specifying them on the commandline is rather inconvenient

The first application

Now for a simple application that will fetch the Slashdot webpage, accept the cookie banner if present and fetches a list of stories together with their links.

First we create a test file named like our test program (in this example called TestProg) containing our basic skeleton.

Note that the style applied in this example is not suited for a real application. One should nearly never ever use catch Exception for example but implement proper exception handling.

import java.util.List;

import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class TestProg {
	public static void main(String[] args) {
		try {
			// Set path of chromedriver binary
			System.setProperty("webdriver.chrome.driver", "/usr/local/bin/chromedriver");

			// Create the driver
			WebDriver driver = new ChromeDriver();

			// Let the user see the final state for 10 seconds
			Thread.sleep(10000);
			driver.quit();
		} catch(Exception e) {
			e.printStackTrace();
		}
		return;
	}
}

As one can see we have set the webdriver.chrome.driver system property. This is not exactly good style either - this should be set (if possible in any way) from the external launcher script. As one can see this property has to point to our chromedriver binary. This has been installed automatically together with our www/chromium package. Then we create the driver using new ChromeDriver(). This creates the browser instance which is remotely controlled by WebDriver. This should also be indicated at your standard error output:

Starting ChromeDriver 78.0.3904.108 (4b26898a39ee037623a72fcfb77279fce0e7d648-refs/branch-heads/3904@{#889}) on port 47736
Only local connections are allowed.
Please protect ports used by ChromeDriver and related test frameworks to prevent access by malicious code.
Feb 10, 2020 10:03:26 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C

Now - before we can fetch some data - we have to accept the cookie banner presented by Slashdot. To do that we first have to determine how we can locate the button. Luckily that’s easy on Slashdot - we use the Inspect feature of Chromium in an incognito tab (to start without any cookies or other session information present):

Using inspect feature of chromium

Now we simply copy the XPath to the element

Copying XPath with inspect feature of chromium

With the known XPath of the link to accept the conditions - in this case it’s luckily an link inside an unique identified element so the path expression is really unique and simple ("//*[@id=\"cmpwelcomebtnyes\"]/a") - we can simply locate the required element using findElement with the By.xpath method and raise a click() event on the webpage:

try {
	// Fetch a webpage. For this example we use Slashdot
	driver.get("https://slashdot.org/");

	// Locate the "Accept" button
	WebElement bannerElem = driver.findElement(By.xpath("//*[@id=\"cmpwelcomebtnyes\"]/a"));

	/*
		Click the element (if it's not present the NoSuchElementException
		would already have been thrown)
	*/
	bannerElem.click();

	// Display a message and provide some time for the user to see the action
	System.out.println("Clicked the cookie banner ...");
	Thread.sleep(250);
} catch(NoSuchElementException e) {
	System.out.println("Didn't have to click the cookie banner ...");
}

As one can see the findElement function would raise an NoSuchElementException in case the banner is not present. This already provides a (not so clean) solution to detect the presence of the cookie banner.

Now to our main task - fetching the titles and links. For this we use the method findElements and supply a class name that we’ve also determined using the inspect method of chromium as an interactive user. This method delivers an list of elements that are tagged with the given class name.

After that we can iterate through the elements, locate the link (a) element contained inside the story-title element, fetch the title which is simply the text contained inside the link as well as the href attribute and output them to the commandline:

List<WebElement> titles = driver.findElements(By.className("story-title"));

for(WebElement elem : titles) {
	WebElement titleLink = elem.findElement(By.tagName("a"));

	String strTitle = titleLink.getText();
	String strHref = titleLink.getAttribute("href");

	System.out.println(strTitle + " + " + strHref);
}

Now we’ve fully created an scraper for slashdot headlines and their links.

A word of caution (when writing bots instead of tests)

If you intend to use selenium to create a bot beware that there are some bot detection scripts that scan for modifications made by Selenium to the browser (injected JavaScript, added properties inside the DOM, etc.). There are ways to prevent this injection and detection by anti-bot scripts but as soon as you’re blacklisted you might have trouble getting unlisted depending on the service. Remember that Selenium is basically created for testing webpages and supplying input that a real user would use. You’ll encounter such Selenium detection scripts when accessing webpages like your bank’s online presence, payment portals and big merchant portals. Be sure to check if they block your account before using your main credentials (at least use some test credentials before being banned with your main account or use some additional set of accounts also on the day to day basis). Also beware that using automated bots might violate terms of service so webservices have a right to block your accounts and deny any further bussines with you …

In any case - please don’t write a spambot. There’s already enough spam on the web. Noone likes that. There are of course many valid reasons to write bots to scrape information from webpages that make lives for direct fetching and processing hard because they do build their webpages using JavaScript without any fallback to plain HTML - that’s worst webdesign practice in my opinion (and normally I simply do not use such pages any more).

Full sourcecode of sample application

The full source is available as GitHub GIST

Update: How to do the same thing in Python

Because I’ve been asked by a student how to achive the same effect with Python - that’s pretty easy. First one requires again the www/chromium package and the Selenium Python libraries (installed via pip install selenium).

Now one can use the selenium package from webdriver:

from selenium import webdriver
from time import sleep

driver = webdriver.Chrome()
driver.get("https://slashdot.org/")

Access to elements works similar as in Java using functions like

an so on. Accessing attributes uses get_attribute and access to inner HTML content is done using the text property.

One can assemble this into the following short program hosted on a GitHub GIST

This article is tagged: Internet, Web, Programming, FreeBSD, Java, Python, Data Mining, Testing


Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://jugujbrirx3irwyx.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support