Once I needed to download a multi-page forum thread (more than 1000 pages) into a single-page document. So I wrote a .NET Core utility, which is presented in this article.

Thread downloader logo

Of course, it is not a universal forum thread downloader, but actually it can be one, because the only thing you need to do in order to adapt it to a different forum - change regular expressions for parsing.

Okay, here’s the thread (printable version, for simplicity):

Forum thread schema

The plan is:

  1. Get header, delete navigation links from it and inject custom styles;
  2. Recursevily download content from all pages;
  3. Get footer and delete navigation links from it;
  4. Save everything to file.

Algorithm flowchart looks like the following:

Forum thread downloader algoritm

Sure, it’s not the most beautiful and efficient algorithm possible. For example, you can already see that actually it’s a bad idea to have the footer processing that might fail in the very end of program - it should be right after header processing.

But anyway, it’s a pretty easy one to implement. The only difficult part here is to compose correct regular expressions. If you don’t know yet, there is an outstanding online tool for that: https://regex101.com

And here’s the implementation:

using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Xml.Linq;

// TODO: Don't accumulate the whole file in memory, implement some buffer and append file on disk chunk by chunk
// TODO: Remote server can ban your IP for so many requests, implement some IP-address switching mechanism
// TODO: Looking for footer should be happening right after processing header (store it in some variable and append after getting content is finished)
// TODO: Find better and more reliable regular expressions
// TODO: Add exceptions, especially for downloading part

namespace forum_thread_downloader
    class Program
        static void Main(string[] args)
            Console.WriteLine($"[{DateTime.Now.ToString()}] Downloading has started\n- - -");

            // create an instance of the class and initialize settings (first page link and regexes)
            TopicDownloader td = new TopicDownloader(
                @"<a rel=""next"" .*(https:\/\/.*.html).*<\/a>",
                @"<!DOCTYPE html>.*<\/div>.{0,4}<br \/>",
                @"<table.*<\/table>.{0,4}<br \/>.{0,4}<div>",
                @"(<table class=""tborder"".*).{0,2}<br \/>.{0,4}<br \/>.{0,4}<table cellpadding=""0"".*",
                @"<br \/>.{0,4}<table cellpadding=""0"".*<\/html>",
                @"<br \/>.{0,4}<table.*<\/table>"

            StringBuilder topicPageBuilder = new StringBuilder();
            if (td.DownloadTopic(topicPageBuilder))
                // save everything to file
                File.WriteAllText($"topic.html", topicPageBuilder.ToString());

            Console.WriteLine($"- - -\n[{DateTime.Now.ToString()}] Downloading has finished");

        class TopicDownloader
            public TopicDownloader(
                string link,
                string reNP,
                string reH, string reHL,
                string reC,
                string reF, string reFL
                _threadLink = link;
                _reNextPage = reNP;
                _reHeader = reH;
                _rePlaceHeader = new Regex(reHL, RegexOptions.Singleline);
                _reContent = reC;
                _reFooter = reF;
                _rePlaceFooter = new Regex(reFL, RegexOptions.Singleline);
                _pageNumber = 0;

            /// <summary>
            /// Link to the first page of the topic
            /// </summary>
            private string _threadLink;

            /// <summary>
            /// Regex for next page link
            /// </summary>
            private readonly string _reNextPage;

            /// <summary>
            /// Regex for header
            /// </summary>
            private readonly string _reHeader;
            /// <summary>
            /// Regex for deleting navigation links from header
            /// </summary>
            private readonly Regex _rePlaceHeader;
            /// <summary>
            /// Regex for page content
            /// </summary>
            private readonly string _reContent;
            /// <summary>
            /// Regex for footer
            /// </summary>
            private readonly string _reFooter;
            /// <summary>
            /// Regex for deleting navigation links from footer
            /// </summary>
            private readonly Regex _rePlaceFooter;
            /// <summary>
            /// Current page number
            /// </summary>
            private int _pageNumber;

            public bool DownloadTopic(StringBuilder topicPageBuilder)
                // try to download the page
                var rez = Task.Run(async () =>
                    var response = await DownloadPage(_threadLink);
                    return response;
                // error
                if (rez.Result.Item1 != 200)
                    Console.WriteLine($"Some error. Status code: {rez.Result.Item1}");
                    return false;

                string webpage = rez.Result.Item2;
                // find header
                if (_pageNumber == 0)
                    var matchHeader = Regex.Match(
                    if (matchHeader.Success)
                        // delete base and add style
                        Regex rePlace = new Regex("<base.*-->");
                        string headerWObase = rePlace.Replace(
                            "<link rel=\"stylesheet\" href=\"threadStyle.css\" />"

                        // delete navigation links and save
                            _rePlaceHeader.Replace(headerWObase, "<div>")
                        Console.WriteLine("[error] Couldn't find header");
                        return false;

                topicPageBuilder.Append($@"<div align=""center""><h1>Page {_pageNumber + 1}</h1></div>");

                // find content
                var matchContent = Regex.Match(
                if (matchContent.Success)
                    Console.WriteLine("[error] Couldn't find content");
                    return false;
                Console.WriteLine($"Page {_pageNumber + 1} has been processed");

                _threadLink = GetNextPage(webpage);
                // debug
                //if (_pageNumber > 3) { _threadLink = string.Empty; }
                if (!string.IsNullOrEmpty(_threadLink))
                    return DownloadTopic(topicPageBuilder);
                    // find footer
                    var matchFooter = Regex.Match(
                    if (matchFooter.Success)
                        // delete navigation links and save
                           _rePlaceFooter.Replace(matchFooter.Groups[0].Value, "")
                        Console.WriteLine("[error] Couldn't find footer");
                        return false;

                    Console.WriteLine("End of topic");
                    return true;

            /// <summary>
            /// Parses the string looking for a link to the next page
            /// </summary>
            /// <param name="a">string to parse</param>
            /// <returns>link to the next page</returns>
            private string GetNextPage(string a)
                var match = Regex.Match(a, _reNextPage);
                if (match.Success)
                    return match.Groups[1].Value;
                    return string.Empty;

            /// <summary>
            /// Download page with given URL
            /// </summary>
            /// <param name="URL">link to the page</param>
            /// <returns>HTTP status code and page as string</returns>
            public static async Task<Tuple<int, string>> DownloadPage(string URL)
                using (var httpClient = new HttpClient())
                    var httpResponse = await httpClient.GetAsync(URL);
                    var httpContent = await httpResponse.Content.ReadAsStringAsync();

                    return new Tuple<int, string>(

Here’s the threadStyle.css I injected:

body {
    width: 70%;
    margin: auto;

table.tborder {
    margin-top: 15px;
    border: 2px solid;

tbody > tr > td {
    color: darkgreen;
    font-weight: bold;

h1 {
    margin-top: 50px;
    margin-bottom: 50px;

And that’s the result (full page is around 14 MB and several kilometres long, so here’s just a fragment):

Forum thread

Complete project source code can be found here.