Fabrice Canel with Jason Barnard - the Bing Series

Fabrice Canel with Jason Barnard at The Bing Series

Fabrice canel talks to Jason Barnard about Bingbot.

Fabrice was on the podcast last year talking about Javascript and the new indexing API. That was very interesting and he shared quite a few insights…

If you would rather read than watch or listen, here is an article I wrote based on this conversation >>

This Episode Takes the Conversation a BIG step Further

This conversation is on a whole different planet. Fabrice is head of the entire discovery-crawling-extracting-indexing process. Think about how much that involves. And how important he and his team are to the process of getting your content to the top of the results.

You cannot hope to get your content into search results if it isn’t found, crawled, extracted and indexed… and since he manages every single one of those steps, he is a person we really need to listen to.

Bingbot and Googlebot Function in Much the Same Way

Obviously they don’t function exactly the same way down to the tiniest detail. But close enough …

  1. the process is exactly the same
    (discover, crawl, extract, index)
  2. the content they are indexing is exactly the same
  3. the problems they face are exactly the same
  4. the underlying technology they use is the same

So the details of exactly how they achieve each step will differ. But they are faced with the same environment and aim to do the same thing – index the web effectively. So, we can safely assume Google deals with the discovery-crawling-extracting-indexing process in a manner very, very close to Bing.

Just think about whatever industry you are in – details differ, but every competitor uses the same foundation. Easy to forget, but this is just another industry. So same here.

Google functions much the same way as Bing. And vice versa. Close enough for us not to need to worry too much about the differences.

Stunning Insights. I Learned sooooo Much.

The conversation with Frédéric Dubut that kicked off this series (this episode recorded at UnGagged) suddenly looks tame and unrevealing. A simple ‘mise en bouche’, as we say in French.

Listen and Learn

  • Google collaborate with Bing on Chromium
  • They discover 70 billion new webpages every day
  • Bingbot pre-filters to stores only the ‘best’ content
  • New technology is coming out for rendering (Machine Learning + Javascript)
  • Standardised HTML is powerful
  • Bing (and we can safely assume Google) is getting exponentially better at extracting information
  • The process of storing the content is MUCH more important than you probably imagine
  • Every candidate set team at Bing relies on Bingbot
  • Nofollow has always been just a hint
  • Sitemaps and RSS are incredibly important
  • Indexing includes annotation, and annotations are fundamentally important to all the other teams and their algos
  • Indexing includes classification, and classification is fundamentally important to all the other teams and their algos

In short, as SEOs, we all depend on Fabrice and his team to an extent most of of us have probably will only start to grasp after watching the episode. This is the foundation of ranking in search. Everything else depends on this.

Fabrice is a truly lovely guy who wants to help you as a website manager… if only you’d help him help you. Here he tells you what he (and, presumably, his equivalent at Google) wants from you so that he can help you get your content to rank.

Help them overcome their problems, and you WILL be rewarded. Groovy !

Catch the rest of the Bing Series:

  1. How Ranking Works at Bing – Frédéric Dubut, Senior Program Manager Lead, Bing
  2. Discovering, Crawling, Extracting and Indexing at Bing – Fabrice Canel Principal Program Manager, Bing
  3. How the Q&A / Featured Snippet Algorithm Works – (this episode) Ali Alvi,  Principal Lead Program Manager AI Products, Bing
  4. How the Image and Video Algorithm Works – Meenaz Merchant, Principal Program Manager Lead, AI and Research, Bing
  5. How the Whole Page Algorithm Works – Nathan Chalmers, Program Manager, Search Relevance Team, Bing

Full transcript of “Bingbot: Discovering, Crawling, Extracting and Indexing (Fabrice Canel with Jason Barnard)”


Jason Barnard: A quick hello, an we’re good to go. Welcome to the show, Fabrice Canel!

Welcome, lovely — you know, we’re in the Bing offices. Yes, again, I had you on the show last year, it was just audio, now we’ve got video so everyone can see what Fabrice looks like. Fabrice, incredibly important person at Bing who crawls, extracts, and stores.

Fabrice Canel: Yes, I do all of it. Every day I am in charge of discovering internet content — all the internet content. I am in charge of selecting the best content on the internet, as you said. I am fetching and crawling the best content from the internet, then processing it and understanding it.

Jason Barnard: So one question is: when you crawl, you’re actually looking for what’s best, so there’s a pre-filter even before the ranking engine?

Fabrice Canel: Every day we discover more than seventy billion URLs that we have never seen before.

Jason Barnard: Every day? Seventy billion?

Fabrice Canel: Seventy billion — it’s a lot of content. Obviously we will remove useless URLs. Just to give you a sense of the size of the internet: the size of the internet is really infinite, there is an infinite number of URLs out there. People create content, but then there are systems that are auto-generating content. You have pages with calendars where we can follow links — all kinds of useless links — but often you have to follow those links to discover whether they’re good or not. You have to fetch them.

Jason Barnard: So when you say you select…

Fabrice Canel: Yes, we select the best content for indexing, but often we have to fetch first to discover whether a link is good or not.

Jason Barnard: So my initial idea was that you do a pre-sorting, but in fact you’re just getting rid of the junk.

Fabrice Canel: We first get rid of the junk, then we still fetch to discover if it is useful or not, because we don’t know. Sometimes it’s just a link to a page we’ve never seen, and we take a decision based on what we find — whether this page is useful for satisfying user queries or not.

Jason Barnard: So with every page, you’re going to crawl it, extract information, figure out if it’s useful or not, and if it’s not useful, do you still store it?

Fabrice Canel: Obviously, if we continuously see that these pages are dead links, we will stop indexing them at some point.

Jason Barnard: But in processing a page, how do you tag it to not be crawled again — or do you just keep crawling it?

Fabrice Canel: Dead links are a very good challenge, because often you have pages that are dead links but come back. You may buy a domain and not populate it — we call that a parked domain, where there is no useful content yet. Then you publish some content, and maybe you forget to renew the domain, so it becomes a parked domain again, and somebody else buys it. There is a lifecycle of URLs. At the end, yes, we take decisions based on URLs — especially what we call tail URLs, which are very long URLs that are essentially useless — and we will stop visiting them, especially if nobody links to them anymore.

Jason Barnard: So you’ve got a lifecycle of URLs — already an interesting concept. You keep tracking them just in case they come back.

Fabrice Canel: Yes.

Jason Barnard: And another thing you just said: very long URLs are a signal that the page is rubbish?

Fabrice Canel: It can be a signal, especially if we continue to see dead links, the URL is very long, and nobody is linking to it anymore. We may decide, okay, this page is a dead link and nobody is visiting it — until we see somebody linking to it again, at which point we say, well, there is a new link to this page, so maybe we should visit it again.

Jason Barnard: Right, so you crawl the URLs, look at what’s in there, and decide if it’s junk or if it’s actually useful. What are the problems with extracting the information? I love HTML5, and John Mueller from Google said it’s probably not worth using because people use it so badly that they can’t rely on it and don’t really pay attention to it.

Fabrice Canel: I disagree a little bit. The web is built not only from pages created by hand in Notepad, but also from content management systems using templates that are well structured with very good information. It’s important to tag content properly to help search engines understand it — h1, h2, and h3 tags are useful for telling the story of headings, for marking the head of a section. Tables that are well structured also help search engines understand the concept of a table, the concept of a list.

Jason Barnard: Incredible. Sorry — I heard that 85% of tables are used for design, which creates an enormous problem for you, because a lot of tables are just there for layout and you’d expect data in them, but in fact they’re just…

Fabrice Canel: Yes, we do not recommend that. We prefer divs, spans, and CSS positioning, and reserve tables for data — for saying, okay, this is the list of planets in the solar system, this is a real table with real data. Using tables for design confuses the understanding of a page.

Jason Barnard: Okay, none of that. So — WordPress tends to be structured more or less the same way. That must really help you. Whereas when I code myself, it never works out the way I intend.

Fabrice Canel: What you need to understand is that search engines these days are machine-learning based. Machine learning is about judgment — we look at a lot of content, tag it, and define what perfect tagging should look like. There is a variety of pages on the internet, some well structured, some not. If you are concretely outside the norm — doing something really random — machine learning will be optimised for the common things, and if you are positioning things in completely random ways, you are not producing what machine learning expects. So a big advantage for anyone is to stick to standardisation, and WordPress is a great example of that.

Jason Barnard: One argument I have quite a lot: some people say WordPress is rubbish, but you’re saying that at least it’s standardised. So the disadvantages — maybe site speed, maybe less flexibility — are offset by the fact that it’s easy for you to read and digest.

Fabrice Canel: WordPress is a content management system with many templates you can use. If you use the common templates, search engines will understand the content because those are things we see a lot. But if you use a custom template that doesn’t follow the standard, that will be something we have difficulty understanding.

Jason Barnard: And might see as rubbish — especially if it’s got a long URL! Sorry, I’m going back to the first point. When I use WordPress I use the standard themes right out of the box, on the assumption that even if it doesn’t look great, at least you’ll understand it. But my problem is it doesn’t look that great.

Fabrice Canel: You can still optimise things to satisfy your user experience while also helping us.

Jason Barnard: So I’d better get to work on that. Okay — we’ve got crawling, we’ve got extracting. Extracting is incredibly difficult, and it’s machine-learning based. Is it getting better at an increasing rate? An exponential rate?

Fabrice Canel: Yes. Fifteen years ago, this was hard-coded rules in the system — look for a title tag, description, meta tags, the kind of thing a junior developer loves to code. These days it’s machine learning. We render pages deeply and extract information from them. I want to highlight that Bing is moving to new technology to understand pages. We have adopted Microsoft Edge as our rendering engine — rendering pages visually, handling them dynamically, executing JavaScript and stylesheets. The new Microsoft Edge is based on Chromium, the same technology used by Googlebot and Chrome. We are also collaborating with Google to improve Chromium — to make it better not only for Edge but also for Bing — because we were having some issues on certain sites, and so we are working together to improve things overall.

Jason Barnard: Something people maybe don’t realise is that Bing and Google actually collaborate on some things.

Fabrice Canel: We collaborate on that, we collaborate also on other standards — the robots exclusion protocol and other things. We have various collaborations, because if search engines each have their own standards, that doesn’t make any sense.

Jason Barnard: Webmasters shouldn’t have to code separately for Bing and for Google. I remember when we created a whole HTML page for each keyword variation for each and every search engine — that was really tedious. I’m much happier now just saying, here’s my content.

Fabrice Canel: Yes. This adoption of Edge as a rendering engine for Bing will make life easier for the SEO community, because you will only have to test once. If it renders correctly in Microsoft Edge, it renders correctly in Chrome, it renders correctly for Googlebot, it renders correctly for Bing. But with one caveat: please make sure you allow access to JavaScript and stylesheets in your robots.txt file.

Jason Barnard: Yeah, JavaScript rendering has become incredibly complex, but it’s getting really, really good.

Fabrice Canel: Yes — and on top of the rendering technology improving, machine learning is improving as well. We achieve very high quality these days. You can see it in Bing’s featured snippets — the Q&A on the SERPs — where we are now able to extract the meaning of a page: the title, headings, lists, and tables. We were not able to do this a few years ago. Now it’s becoming really, really good.

Jason Barnard: And you can extract a featured snippet from the middle of a page because you understood the structure. That leads me to the next question: you’ve extracted this information and put it in a database for the ranking team to pick up. Which means the way you present the data dramatically changes how the ranking team perceives it — and therefore the rankings. We all rely on you, basically.

Fabrice Canel: The Bing ranking team relies on my team to extract and present information they can leverage. For example, if a page is in Japanese but my stack incorrectly detected it as French or English, the ranking team will have difficulty retrieving it for Japanese queries. That’s just one example.

Jason Barnard: So when you put it in the database, you’re tagging it with the language, the type of content. You’ll take a chunk — an h2 with a paragraph of text — and say, this is an h2, it’s in French, and then the ranking team asks, okay, is it useful or not?

Fabrice Canel: Yes. The internet is HTML. We understand it, and we add a rich layer of annotation on top — a large number of extracted features — and we provide those annotations so that other teams can retrieve, display, and make use of this data.

Jason Barnard: So to paraphrase: if your annotation work is done badly, there’s nothing anyone else can do about it. And the way you annotate — I love that word — makes all the difference in the world.

Fabrice Canel: Yes, we have to do it right, because if we detect and classify things badly, nothing better can happen downstream. We will return another URL that may be less relevant than the correct one.

Jason Barnard: Okay. The blue link ranking team goes in and says, okay, this page has no description, here’s the content — is it relevant for this query? But the featured snippet team will be looking at different chunks that you’ve identified, extracted, and stored separately.

Fabrice Canel: That’s a good question — I can’t answer it fully. But ultimately, the whole of Bing relies on my team’s understanding and processing of the internet. We don’t have a separate discovery system, selection system, and processing system for featured snippets. Everything is combined — the various teams extract from my database all the information they need, and display useful results based on what I’ve processed.

Jason Barnard: I learned recently from Nils Rooijmans about using JSON-LD stored in a database — MySQL in that case, nothing to do with what you’re doing — but having this three-dimensional structure where you query, say, all annotated Q&As, or all annotated h2s with their paragraphs, and just pull those out for analysis.

Fabrice Canel: I can’t speak to what they’re doing specifically — they’ll tell you in the next videos — but my job stops at writing to this database: writing useful, richly annotated information, and handing it off for the ranking team to do their job.

Jason Barnard: Brilliant. Another question: to me, the crawling is incredibly difficult, the extracting is incredibly difficult, and storing in a database in a way that other teams can reliably query is also incredibly difficult. Three phenomenally different jobs.

Fabrice Canel: Yes.

Jason Barnard: For quality control — once you get to the scale you’re at, I won’t ask the exact number because I know you won’t tell me, but I’m going to say a hundred billion for everyone listening, and that’s my number not his — how can you possibly do quality control at that scale?

Fabrice Canel: Quality control starts with auditing the data itself. A simple example: dead links, where the HTTP response returns a 404. That information comes directly from the website.

Jason Barnard: So put a 410 in and let’s go!

Fabrice Canel: I prefer that. It’s a shame, though, because what we see is that people make mistakes implementing HTTP status codes. We see what we call soft 404s — pages that return a 200 response but are effectively dead links.

Jason Barnard: Even if you put a 404 or 410, we genuinely don’t know if you’ll publish again on that URL later. So is it really gone?

Fabrice Canel: Yes, at the end we treat all information provided on the internet as a hint telling us what’s good or bad. We don’t take it as absolute truth.

Jason Barnard: Sorry to jump in there, but the word “hint” just came up — we had the nofollow debate, where Google said nofollow was “just a hint.” I always thought it was taken as gospel, but they’ve always been saying it’s just a hint. So is everything just a hint and you make your own mind up?

Fabrice Canel: There are things we do not treat as a hint. An example is the robots exclusion protocol — a webmaster can specify “noindex.” We take that as a definitive statement. Noindex means we do not index.

Jason Barnard: So nofollow is a hint, but noindex is definitive — you won’t index that page.

Fabrice Canel: We will not index a page marked noindex. But for nofollow — you see, we have to fetch the page first in order to discover the noindex tag.

Jason Barnard: Okay, so we’ve actually gone back to the beginning — crawling and trying to figure out all the content — because it’s all completely interrelated. You can’t separate the three big jobs.

Fabrice Canel: Discovery is the first and most fundamental step. I need to discover these URLs, I need to select them. Guide me to good quality URLs, help me to fetch the content, and help me to process it — because there are four steps which…

Jason Barnard: Alright, my little summary failed there because discovery comes before everything else.

Fabrice Canel: I don’t know what is published on your site. Tell me. Guide me to your content.

Jason Barnard: So you’re asking us for help with signposts — links, sitemaps, RSS feeds, or APIs.

Fabrice Canel: Guide me to the content.

Jason Barnard: Fabrice, thank you very much. A quick goodbye to end the show — thank you, Fabrice, that was brilliant.


Similar Posts