05: Automating Chrome with AutoHotkey & getting data from page (part 1)

Automating Chrome with AutoHotkeyWhen GeekDude and I recorded this topic, it was over an hour long so I broke it into two parts.  This session covers how to get content from a page.

The great news is that GeekDude explained how we can see the Reddit site the way it was the below video!

By logging into https://old.reddit.com/r/AutoHotkey/, the HTML will be the same as in the video!

Automating Chrome with AutoHotkey & getting data from page


donate to GeekDudeIf you’re loving this, please consider donating to GeekDude!
Notes from Automating Chrome with AutoHotkey & getting data from page

Scraping data with Chrome and AutoHotkey

00:30     Let’s scrape from the AutoHotkey subreddit  https://www.reddit.com/r/AutoHotkey/

00:42     Maybe you want to get all of the links to the comment section.  Or maybe the titles for all of them for text analysis.

01:22     We’ll start off getting just one thing (instead of multiple).  Let’s look at sidebar.

01:35     Inspect the sidebar and see what the HTML looks like.   Blockquote sections.  (When I look at the HTML on the page, I see very different structure than GeekDude had thus I’m not going to be able to replicate the AutoHotkey code.)

01:58     Copy the JS path to just one, let’s see what we get.

02:18     If you look closely you’ll see #form-t5_….  This looks like it is automatically (dynamically) generated.

02:45     Reloading the page you see the form info has changed so our strategy isn’t going to work that way.

03:14     Try just using div > div blockquote:nth-child(1).  Yes, that worked for now, but it may not always get the right one.  We’re going to look at a more general approach.

03:53     Let’s say we want everything in that blockquote.  Right above it there is a class with md so let’s try that

04:05     document.querySelectorAll(“.md”) we’ll see how many it returns.

04:25     Why dot md (“.md”)?  Because it is a CSS selector.  The dot “.” is a shortcut for class.   Here are a few examples:  # = id, . = class, [] =attribute, * = all elements.  This CSS resource page is a great resource for showing a ton of different ways to find things on a page.  You can practice on this page

I also shared these two resources from Michael Sorens which present the same data grouped by Method and grouped by Tool.

05:01     so “.md” will select all classes md on the page.  But sometimes you might just want to get it by something else.  This is where the iWB2 learner tool makes it easy to see the parent structure.

5:26        There’s another element with a name of “thing_id”.  If we look at the CSS selectors, we have an option to select things with a given attribute.   document.querySelectorAll([name=thing_id]) .

6:00        We can see that it get’s 1 item and that’s the hidden input tag.  Or perhaps use the ID.  It’s the dynamically generated bit that can throw us off..

06:58     We’re going to start with that code to get .md and then let’s just get the live chat section.  So the 3rd blockquote element.

07:20     There are two ways you can chain things together  In the CSS selector you can add a space and then put another selector.  Now it will find any block quote that exists in the tree below that .md.  That get’s us our list of blockquotes

08:14     Or you can run another query (because it of dot notation).

document.querySelector([name=thing_id]).document.querySelectorAll(“blockquote”)

or we could have done

document.querySelectorAll([name=thing_id])[0].document.querySelectorAll(“blockquote”)

09:09     The “chain” in selectors can be very helpful selecting what you want

09:26     What if some of the blocks aren’t visible for some people (so the 3rd block quote may be different).  Would you get an attribute that has a title

10:25     You’d probably loop over the list of items searching for “documentation”.

10:45     While CSS has a lot of ways to select elements, it can’t look at the text contained inside it. So you’ll want to start with the collection then filter it further.

11:43     Don’t forget JavaScript lists are zero indexed.  The first item is 0., the second is 1, the third is 2.

12:20     if you want to grab the item that says “documentation” you’re going to need more code than that.  The simplest way is to set a variable to your collection

Var x = document.querySelectorAll(“md blockquote”);

Then you can iterate over it with a loop

13:27     The old JavaScript way that will work in most situations is to write it like this:

For (var i=0; i<x.length; i++) {

Console.log(x[i]

}

13:56     That goes through each item in the list and you could check on them.  But there is a newer way in JavaScript that will work in most situations which allows you to not save a variable.  Array.prototype.filter()

Array.prototype.filter.call(document.querySelectorAll(“md blockquote”),(e) => console.log(e))

15:25     Or do it simpler (on an Array.prototype or an actiona array)

[].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Documentation”))

16:30     Make sure you pay attention to capitalization.  JavaScript is case sensitive!

For above:

  • [] =some array
  • [].filter.call(document.querySelectorAll(“md blockquote”) = narrowed it down as specific as you could
  • (e) => e.innerText.startsWith(“Documentation”)) = e will be the variable for the element you’re comparing
  • startsWith(“Documentation”) = the comparison / search you’re trying to find

18:15    Since we can see it highlighted, we know we’re on the right one.

18:20    Filter returns a list of matching items (it is still an array) so you’ll want to get your first returned item.

[].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Documentation”))[0]

19:04     Could you have returned the array and shoved it into an AutoHotkey object?  Then looped over it?

19:20     Maybe you want to grab it and do some processing in AutoHotkey.  It is possible with some of the Chrome Dev tools API endpoints.  It gets really difficult, really quickly.

19:40     For that reason the recommendation is your JavaScript should output the data you want.  With IE it is easy to pull into AutoHotkey but with Chrome it is so much easier to get what you want in JavaScript and then pass that data on.

20:20     Should you always terminate your JavaScript commands with a semicolon?  If it’s working without a semicolon, then don’t add one…

21:20     The example GeekDude was giving was two parameters to filter.call

[].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Documentation”))[0]

21:48     It looks odd because some of the techniques are not available in AutoHotkey.

22:11     Let’s say we wanted to get “Live Chat” instead.

[].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Live Chat“))[0]

22:31     Now let’s break out the first item.  Let’s do something with that.  Let’s look at it’s html using the inner.HTML property on blockquote

23:12     But maybe what you really wanted is the link in that section.  This is where chaining queries is the way to go

[].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Live Chat“))[0].querySelectorAll(“a”)[1]

24:13     Now we’ll get the second one (remember JavaScript is zero indexed)

24:49     If you really want to get good at Web Scraping, read an intro book on building websites.  Studying the DOM is critical to being able to web scrape.  Everything else is much simpler once you understand it

25:28     Now that we have the JavaScript code for getting what we want, we can put it in our AutoHotkey code.

25:34     Just setting it as a text assignment is much simpler

js = [].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Live Chat”))[0].querySelectorAll(“a”)[1]

msgbox % page.Evaluate(js)

26:12     Make sure your connecting to the page correctly…

26:40    Now we get an empty box (but no error).  AutoHotkey is not great at peaking inside an object.  Page.Evaluate returns an object.

27:28    In the current version of Chrome.ahk there’s a function you can call that takes an object and makes it into a way to see inside it.  This was built by Coco  AutoHotkey-JSON but is included in Chrome.ahk

js = [].filter.call(document.querySelectorAll(“md blockquote”),(e) => e.innerText.startsWith(“Live Chat”))[0].querySelectorAll(“a”)[1]

msgbox % Chrome.Jxon_Dump(page.Evaluate(js))

28:27     Here we’ll see it is an object with two types.  The Key and value.  We want the value.

msgbox % page.Evaluate(js).value

29:50     Now that we have what we want, we can put it into a variable to use it later, run it, etc.

30:22     Page.Evaluate is returning an Object with a type and a value.  Every once in a while you might want type.

 

 

 

 

 

Comments are closed.