Matt Cutts: Should I block duplicate pages using robots.txt?

Video transcription

Here’s a fun almost like a math problem question from HalfDeck Davis, California;
If Google crawls a thousand pages a day and leaves the train four hours not as good going.
If Google crawls thousand pages a day and GoogleBot crawling many dupe content pages might slow down the index in a large site. In that scenario, do you recommend blocking dupes using robots.txt or is using META ROBOTS NOINDEX, NOFOLLOW a better alternative?
Uff…It’s an interesting question.

I believe if you were to talk to a crawl indexing team they would normally say, look: let us crawl all the content, we’ll figure out what parts of the site are dupe, so which subtrees are dupes and will combine it together.
Whereas if you block something with your robots.txt, we can tell our crawler, so we can ever see there is a dupe and then you can have a full-page coming up and then sometimes you see these uncrawled URL’s where we saw URL, but we were able to crawl it, so that we can see that it was a dupe.

So,I think crawling and indexing guys would say probably just go ahead and block these dupes using robots.txt.
Now, if you got an incredibly weird side where you’ve got 16 copies of different things, I could imagine blocking someone that was robots.txt just so that we don’t crawl it multiple times in lots of different ways.

But before I do that I would really try to like Google crawl the pages and see if we can figure out the dupes are ON.
I would also look at whether you can re-architect your site and worst case I’d look at whether you could maybe use URL parameters in a way where you can tell Google with our new tool and a good webmaster console.

Hey! this parameter doesn’t matter, this parameter is a session ID this parameter doesn’t matter. Coz it’s often a lot of ways where you can set that up and we could collapse that down so that you could help Google have information that it would need in order to combine them.
So, I would I would typically not recommend blocking dupes out using robots.txt that would be served the last result.
I would explore all the other ways of doing your site architecture and letting Google figure it out on our own before I would even consider going to that final step.

Quick Answer: Let Google figure out the duplicates on their own before taking this step

Matt Cutts: Should I block duplicate pages using robots.txt?

Submit a Comment Cancel reply

Recent Posts

Categories