Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats

mikeytown2 - August 3, 2009 - 07:01
Project:Boost
Version:6.x-1.x-dev
Component:Caching logic
Category:feature request
Priority:normal
Assigned:Unassigned
Status:closed
Description

This thread aims to combine these 2 together.
#363077: Add spider to crawler - Cache entire site with new install.
#337391: Setting to grab url's from url_alias table.

Ideally the export module would be the way to go; only problem is the batch api doesn't work from cron (see #229905: Batch API assumes client's support of meta refresh). So I need to combine the above two issues. This is the goal of this thread. First step is to grab some URL's from the db on cron and cache them.

#1

mikeytown2 - August 3, 2009 - 07:03
Status:active» needs review

This does 5 URL's at a time.

AttachmentSize
boost-538460.patch 3.56 KB

#2

mikeytown2 - August 3, 2009 - 10:21

This does 5 at a time and calls it's self for unlimited URL crawlage.

AttachmentSize
boost-538460.1.patch 6.02 KB

#3

mikeytown2 - August 4, 2009 - 02:05

Fine grained control over html, xml & json.

AttachmentSize
boost-538460.2.patch 10.33 KB

#4

mikeytown2 - August 4, 2009 - 10:08
Status:needs review» needs work

In order for this to be smart, I need a boost_crawler table. Once the crawler is done it will truncate the table, signaling to the worker thread(s) that it is done. This way I can grab URL's from boost_cache, url_alias, & nodes; and I should be able to make it use multiple threads. I can also then only crawl pages that are not cached currently.

id - serial - PK
extension - varchar 8
url - varchar 255

index on ID & extension.

#5

capellic - August 4, 2009 - 16:09

Exciting to see progress on this!!!!

BTW, just wondering if my issue over here at the Google Analytics module is cause for concern when pre-caching. Certainly this issue wouldn't only be limited to GA, but to any sort of stat tracking module.

#538626: Exclude hits when a specific arg string is present

#6

mikeytown2 - August 4, 2009 - 19:31

@capellic
The crawler loads the html, it doesn't load the elements inside like javascript or images, so you don't have to worry about 3rd party stats. It will count on Drupal stats though, since that counts when the html is loaded.

#7

mikeytown2 - August 5, 2009 - 08:51

Just tested the above patch on a live server and it works as advertised; which means the core of the crawler code works in drupal core. Thats good, it's like the batch API but it doesn't need a browser to call it's self. So my two step php hack works, now I need to make the crawler smarter.

Roadmap:

  • Copy the URL's from the boost_cache table into the boost_crawler table. Pass a URL variable to keep track of the progress for large sites; transfer 10,000 URLs at a time to the boost_crawler table via LIMIT. Need to set ORDER BY filename in the SQL.
  • Allow for a stop signal to be given
  • - Ship RC1 -

  • Support multiple threads. This will be built with this in mind, but wont be enabled until it's been tested.
  • Use the url_alias table to populate the boost_cache; that will assume all content in there is a html doc, unless it ends in .xml or /feed.
  • Enable showing stats on the crawler
  • Allow admin to better fine-tune some parameters of the crawler
  • Have a real crawler, like what I have in #363077: Add spider to crawler - Cache entire site with new install.; auto discovers content not in the boost_cache or url_alias table, but is in a <a href=""></a> html tag.

#8

mikeytown2 - August 5, 2009 - 11:06

Wow, thats some slow code... #3 took over twice as long to crawl a site as the crawler does. Some of it has to do with the sleep(1) but thats pretty awful. Doing a bootstrap to crawl every 10 pages is another source. I knew this would be a little bit slower, but I wasn't expecting this. For this to work well, there's a lot of work to do. I'll try refactoring the code (I made sure I wouldn't get a timeout), but if I can't get it's performance to improve, I'll be shipping RC1 without the cron crawler.

#9

mikeytown2 - August 6, 2009 - 00:49
Status:needs work» needs review

Still need to add in a stop crawler button & code. But this should be faster and it defaults to using 2 threads. Loads 25 url's per run.

AttachmentSize
boost-538460.3.patch 15.45 KB

#10

mikeytown2 - August 6, 2009 - 08:21
Status:needs review» active

Committed the above code with a stop button.

#11

mikeytown2 - August 6, 2009 - 08:31
Status:active» fixed

#12

capellic - August 6, 2009 - 12:28

You're on fire! Just a question about item #1 in the roadmap.

Copy the URL's from the boost_cache table into the boost_crawler table.

Will this require me to use HTML file caching? Right now I am using my Pre-Cache module only to trigger Drupal core's database cache and do not have Boost HTML file caching turned on. Why not? Because the pages that aren't pre-cached aren't as popular so 1) they don't need to be cached and 2) I don't want the user to have the additional penalty of having to wait for the HTML file to be written to disk.

I realize that part of this is moot because the crawler removes concern #2, but I am still a bit concerned about having the crawler hit all nodes due to aforementioned performance issues - actually -- it's more of a preference to see the two be able to work independently from one another for flexibility's sake.

#13

mikeytown2 - August 6, 2009 - 13:23
Status:fixed» needs work

Right now the crawler is tided to boost quite heavily, so it will only hit URL's that it can grab from the boost_cache table. If you enable boost, hit the url's you want then disable the html file cache; the crawler will still grab those url's in the future. Right now boost could be split off into about 3 different modules, but it's much easier to develop for it when it's all in one codebase. Your request gives me another table to get URL's from which is cache_page
http://api.drupal.org/api/function/page_set_cache
It's a temp table so this wouldn't be that useful.

Here's the first bug... crawler finished and only did 1/2 the URL's here's an attempted fix. It's amazing how this doesn't work the same on all systems... gotta find code that does.

AttachmentSize
boost-538460.5.patch 1.54 KB

#14

mikeytown2 - August 6, 2009 - 14:53
Status:needs work» needs review

Figured out whats up. Each thread needs to wait a random amount of time otherwise things get messy.

AttachmentSize
boost-538460.6.patch 3.27 KB

#15

mikeytown2 - August 6, 2009 - 14:57
Status:needs review» fixed

committed this

#16

giorgio79 - August 13, 2009 - 12:20

Hey Mikey,

Nice stuff.

Would you know the crawl rate? Something like how many pages per sec is crawled?
Could I request a feature to set time between crawl requests, like 4-5 secs between each?

Cheers,
G

#17

mikeytown2 - August 13, 2009 - 17:22
Status:fixed» active

Being able to get the crawl rate would be semi possible. It would be the average since the crawler started counting all threads & it would be quite inaccurate. Increased accuracy means a much slower crawler, since each thread would write to the database after each page was crawled. Having a "throttle" would also be possible using the usleep() function inside the crawler; but why wait 5 seconds between each request? Being able to set the number of threads is also needed.

#18

giorgio79 - August 13, 2009 - 17:46

The 5 sec was just random, if this could be set, that is, the secs to wait between requests would be nice.

Just like the number of threads running.

I am happy with 1 thread with 2-5 secs between requests. :)

I have 10 000 nodes, and I am afraid of hammering the server if I turn the feature on as it is at the moment :)

Also not sure about the php timeout for this many nodes.
Is this done when cron runs? Or when the submit button is pressed to clear cache?

#19

mikeytown2 - August 13, 2009 - 18:08

php/cpu timeout is taken care of; it shouldn't happen, ever. Crawler starts right after cron is run.

#20

g10tto - August 13, 2009 - 20:14

I have a site requiring user authentication for 95% of its content. Since Boost only caches pages viewed by anonymous users, is it possible to still use this crawler (using default Drupal and Views caching) with my site?

#21

mikeytown2 - August 13, 2009 - 20:37

@g10tto
The crawler hit's pages as an anonymous user, so if that page returns a 403, it doesn't really help anyone. Drupal's cache (what you see on the performance page) is for anonymous users as well. At the bottom of the boost project page are links to other performance/caching methods.

#22

g10tto - August 13, 2009 - 21:44

@mikey Thanks for the tip.

Is there another known crawler program for Drupal, or a way to implement a 3rd party crawler? I'm just curious, because it seems like the crawler functionality would especially come in handy on larger news-related sites that have an archive of nodes, and I would be surprised if there were no other implementation until now.

In addition, I have wondered for some time now why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

#23

mikeytown2 - August 13, 2009 - 22:02

#363077: Add spider to crawler - Cache entire site with new install.
Here's a crawler I made thats independent of drupal. It's crawled a site with over 1,000,000 url's; it works, but not easy to setup.

...why Boost is restricted in this way, if such functionality regarding this thread is restricted to the module.

I'm assuming your talking about it only caching what anonymous users see. Anonymous is easy; there's only 1 version of each page, and with that, boost is still a very complicated module.

#24

mikeytown2 - August 16, 2009 - 07:18
Title:Auto Regenerate Cache (pre-caching) Preemptive Cron Cache» Auto Regenerate Cache (pre-caching) Preemptive Cron Cache - throttle & crawl rate stats

#25

mikeytown2 - August 21, 2009 - 02:37
Status:active» fixed

#26

capellic - August 29, 2009 - 22:04

Nice! Since I've built my own Pre-Caching module to take care of the 5 to 10 top-level pages on my site, I don't have a use for this at this time. I have a couple of bigger projects coming up and I'll certainly be giving this feature a spin!

#27

System Message - September 12, 2009 - 22:10
Status:fixed» closed

Automatically closed -- issue fixed for 2 weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.