06: Chrome and AutoHotkey: Getting lists from page

Automating and ChromeHere we continue with GeekDude working with Chrome and AutoHotkey extracting data from a webpage.  This session we focus on getting lists and leverage JSON, Chrome.Jxon_Dump, JSON.stringify, Chrome.Jxon_Load and jQuery.

The great news is that GeekDude explained how we can see the Reddit site the way it was the below video!

By logging into https://old.reddit.com/r/AutoHotkey/, the HTML will be the same as in the video!

Chrome and AutoHotkey: Getting lists from page

donate to GeekDudeIf you’re loving this, please consider donating to GeekDude!

Notes: Chrome and AutoHotkey: Getting lists from page

00:05     Sometimes you want multiple items from a page.  Maybe all post titles, or all links form comments.

00:20     Looking in HTML we see there’s a lot going on.  We need to look at the structure BEFORE we get working on it.

01:04     Classes are a great way to get a hold of something when you’re automating.  The IDs seem to be dynamically created so not helpful for us.  Let’s look at elements that are “things”.

document.querySelectorAll(“.thing”)

02:10     We get a lot of items (more than just posts as the 1st post is item # 4).

02:14     Let’s take another look.  They all have thing on them.  They all have unique ids (which we don’t want to use).  It goes back and forth between even and odd.

02:41     Let’s try .link

document.querySelectorAll(“.link”)

03:02     The top looks good, the bottom looks pretty good (just posts).  Sometimes you want to use both.  Here you append them together

document.querySelectorAll(“.thing.link”)

03:43     If you put a space in the middle, it assumes hierarchy and looks for link that has a parent that’s a thing.

document.querySelectorAll(“.thing .link”)

04:04     The greater-than sign > requires the second item to be a direct child.  A space lets it be anywhere in the children

document.querySelectorAll(“.thing > .link”)

04:37     For automation, the >, < , . for class, # for ID and element name will get you there most of the time.

05:19     One of the big deals with Chrome.ahk is that it’s hard to go from a list of elements to a list of data in your AutoHotkey code.  You want to do all that processing in the JavaScript side of it.

06:00     The appropriate method is .map   Like we did with the array.prototype filter, map isn’t available from querySelectorAll.  So we’ll want to

06:28     [] on the far left gets you anything that is a valid array

06:53     Array.prototype is the default in JavaScript.

07:46     This .map function does a marvelous thing!  It takes whatever your list is (in our case div tabs) and gets something out of each of those and returns a new array with just that sub-value!

[].map.call(document.querySelectorAll(“.thing.link”), (e) =>1)

[].map.call(document.querySelectorAll(“.Post”), (e) => e.innerText)

 

08:43     But this still isn’t quite what we want

08:55     We need to drill-down a little further.  Looking at first post, we could get the author, URL, etc.  But the post title isn’t part of the top element.   The post element is under a p tag with a class of “title” and the link is an a tag with class of “title.

09:59     Like we did earlier with chaining, we can use a querySelector to get results from just there

[].map.call(document.querySelectorAll(“.thing.link”), (e) => e.querySelector(“a”)

But if we wanted to be extra safe you can get the “a” element and the class “title”

[].map.call(document.querySelectorAll(“.thing.link”), (e) => e.querySelector(“a.title”))

And then you can grab the text from there

[].map.call(document.querySelectorAll(“.thing.link”), (e) => e.querySelector(“a.title”).innerText)

Let’s reformat it to make it clear

[].map.call(  ;We want to map over

document.querySelectorAll(“.thing.link”), ; all of our posts

(e) => e.querySelector(“a.title”).innerText ;On this code, get our value from that post.  Query / filter for the title elemen’ts innerText

)

11:50 Now we have an array of titles.  That’s exactly what we need to get our value into AutoHotkey.

12:31 Now if we run this in AutoHotkey, we still get blank msgbox because we’re returning an object.

13:00 Run the JS through the the Jxon_Dump function

js=[].map.call(document.querySelectorAll(“.thing.link”), (e) => e.querySelector(“a.title”).innerText)

msgbox % Chrome.Jxon_Dump(page.Evaluate(js)) and we’ll see the subtype is an array

13:27 To make it easier to read, the 2nd parameter in the Jxon_Dump() function, is for formatting. If you add a 4, it will make it “pretty”

Msgbox % Chrome.Jxon_Dump(page.Evaluate(js),4)

13:36 Where each key is started on a new line with four spaces before it.

13:56 The data we see is that it is an array.  Not that any of the values are there.  That’s just a quirk of Chrome Dev tools protocol.  More often it will tell you a very short version of what you’re looking for.

14:09 Take advantage of the Jxon_Dump function Coco wrote as you want to serialize it.   When you’re passing large amounts of data from JavaScript to AutoHotkey, you’ll want to serialize it.

14:45 To add one more layer of complexity, wrap all of that in   JSON.stringify

JSON.stringify([].map.call(document.querySelectorAll(“.thing.link”), (e) => e.querySelector(“a.title”).innerText))

15:42     AutoHotkey objects and JSON are almost exactly the same format.

16:00     Let’s put the JavaScript into a continuation section in AutoHotkey.

16:23     Let’s first look at it and make sure it is correct.  We need to move up the right parens to let AutoHotkey easily process (we could have escaped them)

17:25     Now lets try and run that through AutoHotkey.

18:13     Let’s look at JSON .   https://www.json.org/json-en.html . And we can see what JSON can & cannot be.

JSON is built on two structures.  A collection of key-value pairs called an object and an ordered list of values called an array.  Just like AutoHotkey an object starts with { and an Array is [

19:31     If you work a lot with JavaScript, you’ll see null come up a lot.  But it doesn’t come up that often in Web Scraping.

20:06     Most of the time you’re dealing with JSON it is because you’ve pulled data from somewhere and they want to tag it with a name (for the key).

20:38 We’ve got our data into AutoHotkey but it is one big glob of text.  That’s where Chrome.Jxon_Load() comes in handy.  When you feed it that JSON array, it will create an AutoHotkey object

Msgbox % Chrome.Jxon_Load(page.Evaluate(js).value)

This gives a blank box because it is an object.

For index, value in Chrome.Jxon_Load(page.Evaluate(js).value){

Msgbox % value

}

21:49 Now we can step through it and see if we’re getting what we wanted.

22:16 We could save this to a variable to process later, or access it multiple times.

Titles:= Chrome.Jxon_Load(page.Evaluate(js).value)

Msgbox % titles[1]

23:26 So we (AutoHotkey users) have to shift our mentality when working with Chrome.ahk to do our heavy-lifting in JavaScript.

23:35 One thing many sites offer on the JavaScript side that makes it much easier include a library called jQuery.  Those sites, you can have it do most of the leg work.

24:06 As long as your JavaScript code outputs a string or a number at the end, it’s really easy to get that into AutoHotkey.  You first have to get your JavaScript to get there!

24:32 type jQuery in the console and see what is provided.

24:38 Something like Google.com may not as it is very slimmed-down.

25:24 jQuery is available in a lot of places, but often not.

25:45 When you Google how to solve things in JavaScript, a lot of responses will be very simple examples in jQuery.

26:23 jQuery is mostly for manipulating the elements on a web page

 

 

 

 

 

 

Comments are closed.