Statistics on chat service online status

03 May 2022 - tsp
Last update 03 May 2022
Reading time 21 mins

Gathering the data
The Node-Red endpoint
Visualizing using Grafana
Intermediate result
- First test over a few hours
- Running for the first week
Taking a deeper look after a month

So this is a project that I wanted to do for a long time since I have some previous experience with social network analysis and web scraping and I wanted to “see” how easy and how privacy invading it is when one just hacks this in the fastest and most dirty version (not with sophisticated data management as I’ve done before for different applications) - my initial thoughts on this have been that it’s way too easy (I basically oppose supporting online status on chat systems where you invite your whole telephone book or everyone you know and totally oppose read receipts or mail delivery notifications) and provides way too much insight that one would not expect. And I was not disappointed - only somewhat surprised that it was even easier than I thought - writing this blog post took way longer than implementing scraping and basic analysis and reading it will also take longer for sure.

Usually the reaction when talking about this problem is just a comment like one doesn’t have anything to hide, one cannot infer much from a simple online status or that it requires extensive skills to perform such analysis (or even that platforms do provide protective measures against such data gathering - which is unfortunately not possible - as soon as someone can see data one can automatically scrape and process it, a clean API just makes the job minimally easier and less frustrating when one’s doing some honest stuff but really no one will ever care when doing malicious stuff to do the minimal extra work required to circumvent any protection)

First of a short but important disclaimer: The people who had been monitored with these tools had been notified and asked for permission. I scraped the whole service that I used but immediately applied a whitelist and discarded data of anyone who didn’t consent. But beware that anyone who has dishonest or stalking intentions can do this without asking and without filtering.

Gathering the data

The first question is how to gather data from one of the large closed source chat services (I decided I didn’t want to monitor stuff like my XMPP network but something that masses are using and thinking they’re protected by it being a proprietary island). I choose a service where I had a web based interface in addition to the mobile application since this makes life way more easy. This is the case for most mass popular messaging solutions up to day anyways.

The first idea was to inspect HTTP(S) transactions during the usage of the service to figure out how notifications worked and replay this monitoring from own custom scripts. This would be the ideal case but the service I used had some measures in place where scripts and tokens had been changing on a regular basis - which by the way is the biggest stopper for third party clients or transports to open chat networks that would provide a huge gain in usability of such services - so long term monitoring was not possible without reimplementing the whole browser transactions and login also used some more complicated client features. So why not reuse the client? The first idea was to use Selenium and host the whole browser session inside the scraping application - but since I wanted to use the Chromium browser this turned out to be more challenging since the client side scripts tried to detect a page running inside an automated session and since Selenium is not a hacking tool it happily exposes it’s presence. Since I didn’t really want to spend more than a few minutes on extracting the data the decision was clear: Just use the browser and access the page content using a content script form a fast hacked Chromium extension to access the data readily available inside the DOM of the page. This turned out to be rater static and reliable and allowed to use the default login flow to prepare the page. The sessions also never timed out when keeping the messaging webpage open due to regular transfer happening which made this approach stable enough to perform data gathering.

So the basic idea was:

Enable developer mode at chrome://extensions.
Create a directory that was used to host the extension that’s usually packed into a ZIP file and named CRX and is then usually also signed. But since I didn’t want to distribute the script and this should only be a quick hack and no properly engineered solution anyways …

The first file I created was a Manifest in manifest.json:

{
	"name" : "Status scraper",
	"description" : "Playing with statistical analysis",
	"version" : "0.1",
	"manifest_version" : 3,

	"action" : {
		"default_popup" : "popup.html"
	},

	"background" : {
		"service_worker" : "background.js"
	},

	"permissions" : [ "activeTab", "scripting" ],
	"host_permissions": [
			"http://www.example.com/*"
	]
}

As one can see the manifest declares some basic information about the browser extension and then:

Declares a popup.html that will be displayed in the browser toolbar as soon as the extension is ready. I use this popup to start scraping on demand in a non automated fashion since I also wanted to use other tabs on the same domain that should not be scraped. The popup.html also references a popup.js that will then be included inside the webpage. This script file also includes the content script function.
A service worker in background.js that will accept messages from the content script that include the presence information that will be forwarded to a Node-Red instance that will shove this information into a simple SQL database.
The permissions include:
- activeTab which is required to get a reference to the foreground tab when launching the scraper function
- scrpting which is required to inject content scripts into the foreground tab
- A host_permissions entry that tells Chrome which endpoints the background script is allowed to contact. This is the address of the Node-Red instance (one could even list the endpoint itself)

My popup.html is pretty basic since I didn’t care about it being pretty or expressive - it was sufficient to provide a launch method for injecting the scraping script.

<!DOCTYPE html>
<html>
	<head>
	</head>
	<body>
		<p> <button id="scraperStart">Start</button> </p>
		<script src="popup.js"></script>
	</body>
</html>

The popup.js script that also includes the handler for the button with id scraperStart is the main workhorse on the scraping side. In case one wants to start the script automatically and inject a content script without human interaction a nice way is simply using the manifest and putting the content function that’s currently contained in popup.js in contentscript.js:

{
	// ...
		"content_scripts": [
			{
				"matches": ["https://www.example.com/chatsession/*"],
				"js": ["contentscript.js"],
				"run_at" : "document_idle"
			}
		],
	// ...
}

Before I could implement this script I had to determine what to scrape. So I searched a way inside the messaging application to display only online users. Luckily this existed (in three different ways). Then I used the inspection feature of Chrome to locate the wrapping element and used the Copy / XPath feature inside the inspection utility to determine the XPath for the element. Even though the page layout was pretty complex due to the framework that had been used there has been an simple list (li) element that wrapped one entry after each other that hosted a single link (a) that I used to extract a unique user ID as well as a nested span element that hosts a plain text human readable screenname of the user.

I won’t put the exact XPath for the page below but substitute it with two example values below that I already substituted the running index (starting from 1) with the variable i:

"//*/div[1]/ul/li["+i+"]/div/a/div/span" for the user name
"//*/div[1]/ul/li["+i+"]/div/a" for the link that I wanted to use to extract the user ID from.

The basic idea was to locate the two element using document.evaluate, check if they really exist and if extract the inner text from the user name element as well as the user ID from the splitted link target. If everything turns out to work I simply append the ID and users screen name to a list of seen users. After iteration the whole structure will be passed to the backend script using chrome.runtime.sendMessage. The whole scraping function is then executed every 15 seconds so it records which users are seen online every 15 seconds. This also allows some kind of monitoring due to periodic heartbeat.

scraperStart.addEventListener("click", async () => {
	let [tab] = await chrome.tabs.query({ active: true, currentWindow: true });

	chrome.scripting.executeScript({
		target : { tabId : tab.id },
		function : runContactScraper
	});
});

/*
	In case one wants to start the script automatically via the content-script
	instead of the action mechanism one simply only uses the content of runContactScraper
	inside the content.js script
*/
function runContactScraper() {
	window.setInterval(() => {
		let i = 1;
		let tsTimestamp = Date.now();

		let activeData = {
			"ts" : tsTimestamp,
			"users" : [ ]
		};

		for(;;) {
			let pathName = "//*/div[1]/ul/li["+i+"]/div/a/div/span";
			let pathLink = "//*/div[1]/ul/li["+i+"]/div/a";

			let elementName = document.evaluate(
				pathName,
				document,
				null,
				XPathResult.FIRST_ORDERED_NODE_TYPE,
				null
			).singleNodeValue;
			let elementLink = document.evaluate(
				pathLink,
				document,
				null,
				XPathResult.FIRST_ORDERED_NODE_TYPE,
				null
			).singleNodeValue;

			if((elementName != null) && (elementLink != null)) {
				let contactName = elementName.innerText;

				if(contactName == '') {
					activeData = false;
					break;
				}

				let contactLink = elementLink.getAttribute("href");
				let contactId = (contactLink.split("/"))[1];

				activeData.users.push({
					"id" : contactId,
					"screenName" : contactName
				});
			} else {
				break;
			}

			i = i + 1;
		}

		chrome.runtime.sendMessage({
			"message" : "activeData",
			"payload" : activeData
		}, response => { console.log(response); });
	}, 15000);
}

Note that there is no way the page can determine that this script is running and scraping their DOM. It’s running in a separated and isolated scripting environment - the only resource it shares with the page itself is the DOM of the page. So there will never be protection against this kind of scraping except when one rebuilds the page layout randomly - but even then it’s not that hard to locate the information one wants to scrape anyways. Please don’t try to evade scraping - people build really useful tools and add value to your webservices.

The background.js script then only has to accept this JSON and pass it to the fetch API:

chrome.runtime.onMessage.addListener(function (request, sender, sendResponse) {
	fetch("http://www.example.com/noderedendpoint", {
		method: 'post',
		headers: {
				"Content-type": "application/json;charset=UTF-8"
		},
		body: JSON.stringify(request.payload)
	}).then(function (data) {
			console.log('Request succeeded with JSON response', data);
	}).catch(function (error) {
			console.log('Request failed', error);
	});
	sendResponse(request.payload);
	return true;
});

Unpacked extension loaded

After finishing the scripts I simply loaded it into chromes extension space by using the load unpacked extension feature on chrome://extensions. There one also is able to access the error console for the background script as well as the error messages when processing the manifest. It’s also there where one reloads the extension on change.

The Node-Red endpoint

Node-RED flow

The next part in the processing chain had been realized using Node-Red. Usually I won’t recommend using Node-Red for any production stuff but it’s just a quick hack and the setup has already been there - and it’s nice to play around. So I simply added a HTTP in node on a dashboard and configured it for POST requests and an arbitrary chosen URI (/dataana/examplestatus). the payload is then always deserialized by a JSON node into a JavaScript object. Since I wanted to write into a MySQL database I added - at the end - a mysql node and configured database, username and password. The SQL queries will then be pushed via the topic field of the messages, the payload contains the bound parameters for the statements.

Then I prepared the database:

USE exampledb;
CREATE TABLE presenceAnalysisUserNames (
	userid BIGINT UNSIGNED NOT NULL,
	screenname VARCHAR(256) NOT NULL,

	CONSTRAINT pk_presenceAnalysisUserNames_id PRIMARY KEY (userid)
);
CREATE TABLE presenceAnalysisSeen (
	userid BIGINT UNSIGNED NOT NULL,
	ts BIGINT UNSIGNED NOT NULL,

	CONSTRAINT pk_presenceAnalysisSeen PRIMARY KEY (userid, ts),
	CONSTRAINT fk_presenceAnalysisSeen_userid FOREIGN KEY (userid) REFERENCES presenceAnalysisUserNames (userid) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE INDEX presenceAnalysisSeenIndexTS ON presenceAnalysisSeen (ts);

GRANT SELECT ON exampledb.* TO 'grafana'@'localhost';
GRANT SELECT,INSERT,UPDATE ON exampledb.* TO 'nodered'@'localhost';

Now I used a simple JavaScript function to transform the incoming payload into a sequence of SQL insert statements that are then passed in sequence to the MySQL node.

let msgs = [];

let ts = msg.payload.ts;
msg.payload.users.forEach(element => {
    msgs.push({
        "topic" : "INSERT INTO presenceAnalysisUserNames (userid, screenname) VALUES (:uid, :scrname) ON DUPLICATE KEY UPDATE userid = userid",
	    "payload" : {
		    "uid" : parseInt(element.id),
    		"scrname" : element.screenName
	    }
    });

    msgs.push({
        "topic" : "INSERT INTO presenceAnalysisSeen (userid, ts) VALUES (:uid, :ts) ON DUPLICATE KEY UPDATE userid = userid",
	    "payload" : {
		    "uid" : parseInt(element.id),
    		"ts" : msg.payload.ts / 1000
	    }
    });

});

return [ msgs ];

Visualizing using Grafana

Now that the data is available let’s first do some basic visualizations:

A 5 minute binned activity graph for every seen user
A 5 minute binned activity graph for every seen user stacked
A 30 minute binned activity graph for every seen user
A simple series view that just shows how active all seen users have been during this time in comparison

The basic query that I’m using is just a basic select on the presenceAnalysisSeen table that uses integer division (realized by SQLs round) to do basic binning of the timestamp values (this is also what would be done by Grafanas $__timeGroup macro), groups by this bins and the screen name that’s fetched via a simple INNER JOIN on the user id and is then filtered by the current selected time range in the Grafana dashboard using the $__unixEpochFilter macro. The measure for activity is simply the number of occurrences of each user inside the bin, the bin size is determined by the number of seconds divided and multiplied by. For example for a 5 minute bin size 300 seconds:

SELECT
  ROUND(ts / 300, 0) * 300 AS "time",
  screenname AS metric,
  COUNT(ts) AS activity
  FROM presenceAnalysisSeen
  INNER JOIN presenceAnalysisUserNames ON presenceAnalysisUserNames.userid = presenceAnalysisSeen.userid
  WHERE $__unixEpochFilter(ts)
  GROUP BY time, screenname
  ORDER BY time;

Unfortunately I did not figure out how to introduce NULL values for times when people are not present with this simple query and graph setup.

Intermediate results

As it turns out even the simple graphs generated show pretty much insight into the daily behavior of people and allow one to separate different groups of people.

First test over a few hours

As a first test I checked on the first hours of gathered data. First a summary of the stacked 5 minute binned activity of the test group:

Test group activity

As one can see this whole group shared some common group behavior - they had been much more inactive before around 6 PM - this is due to working behavior most likely. Then one can see a drop in activity before news and prime TV hours started with a short increase in activity during advertising between those two TV blocks. Note that this is collective behavior. Individual (non stacked displayed) behavior is much more individual:

Individual test group activity

If one looks at individual behavior one can see some people just checked in for about half an hour:

Individual test group activity

While other people had been active over a longer period of time:

Individual test group activity

The series plot also contains immediate information about the activity of individuals on the services webpage or mobile app:

Individual test group activity

Running for the first week

The next time I checked back was when the script ran for nearly a full week. The first thing that one immediately sees for collective behavior is the daily pattern. This worked best in stacked 1 hour binned view:

What I found most interesting on the collective patterns is:

Most people seem to stay up till somewhere between 11 PM and midnight
Most people also check back in at 8 AM again
Activity is high on workdays over the whole course of the day increasing radically at 5 PM
Workday online activity seems to be increasing from Monday to Friday massively which reflects the known productivity decline over the 5 day workweek. As one can see Friday shows way larger activity over the entire day than the other four workdays.
On weekends as expected activity started at 10 AM to 11 AM a little bit later but also there was a huge drop in activity on afternoon (i.e. where most people seem to follow their hobbies or the activities they like - thus reducing online presence)

Then I did take a look into distribution of activity levels:

Activity of different people over the whole week

As one can see a single individual turned out to be way more active than anyone else (after talking back this had been someone writing up a PhD thesis). But even this person had the typical activity pattern that one sees for more active people so it was not a client just left running 24/7, it really was the services usage pattern.

On the other hand I found one person who (also asked afterwards and got a confirmation for this theory) used a mobile application version of this communication solution. At any point in time the device the phone went out of standby the application indicated available presence. This exposed - in addition to daily usage pattern - the daily charging pattern of the mobile phone. So one could assume that this is the usual time of being home and most likely being asleep.

Activity of different people over the whole week

Taking a deeper look after a month

After gathering data for a few weeks I then decided to take a deeper look. The basic ideas I wanted to tackle:

Check how usage of social media compares:
- Between different days of week
- At typical work hours vs non work hours
If one can assign people to using social media at work or not
If one can segment people into different usage type clusters depending
- On their daily usage compared to average daily usage
  - On their usage frequencies

Since I had changed some ways I gather data I had to limit myself first to a time span using the same gathering method to compensate for those effects to not have to take that into account.

Activity tracked over a single month

To get a feeling if there is a huge difference in how often people use the given service I first simply counted the times they’ve been seen online and counted events. As one can see there are of course already some people who are way more active than others. This can already be used to define a normal range by calculating the normal 5 point summary and thus segment due to Quartils:

Activity over a single month colored by quartil

Now lets get to more interesting stuff. Let’s look at the average usage time of day by segmenting the day into quarter of hours and collecting counts there:

Activity on days averaged

To compensate for extremes one can simply only use the center Quartils and discard the users who are really extensively or rarely using the service:

Excluded extreme and rare users

The next idea was of course to see how good people fit that average activity by normalizing the activity levels and comparing them to the average one to get a score how average people are (or in other words how well they adhere to majorities common behavior). Note that this is of course already biased by excluding extreme users from the baseline before - it somehow marginalizes the patterns by around 50% of all users in this case but that’s usually not much of a problem since the bandwidth of normal behavior is usually pretty large.

How much people deviate from normal behavior

Checking against again the real life behavior of those people that had been marked red in the plot above most of them had some pretty obvious deviations from the behavior of the remaining group (being dead and just have a lingering profile, being retired whereas the majority was not, being in a crisis and unemployed with strange sleeping habits, having huge health problems, etc.). The people marked green are usually within one standard deviation of the original baseline data, the people marked blue seem to be over-compliant (or represent the seemingly ideal average behavior).

The next step was to do some correlation analysis. Since the chat service supports real time communication it’s likely that people who communicate primarily with each other will be more likely be active at the same (or similar) time. This will be done by looking at correlations between peoples activity times:

Correlation matrix

As one can see the used dataset is not really suitable for doing such an analysis since it mainly contained a single social group - as one can see from the structure. The most interesting features of this matrix are to be found in the less populated regions. There some random symmetric local maxima really indicate people who communicate most likely with each other - couples who don’t use the service often but when they use it they use it with each other for example, …

The last thing I wanted to try with this dataset is now anomaly detection - calculating and individual baseline behavior and span for individual people inside a sliding window and check when they deviate more than usual from their usual behavior. The sliding window has been chosen to be about one week so this should show results even when people went to holidays, etc. Hopefully I find some time (and enough more collected data) the next few weeks to add that …