Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

need help handling url/post variation (embed,download post article)

         

busos

4:51 am on Aug 5, 2023 (gmt 0)



I have a post article URL with several variations used to embed a PDF using an iframe (full article in PDF) and a page to download the entire article in PDF format.

The original post page is actually just a summary of a lot of information, such as citation, download statistics, and article excerpts.

https://website/index.php/category/article/view/75417
https://website/index.php/category/article/view/75417/40283
https://website/index.php/category/article/download/75417/40283


In reality, these three pages are just variations of the same page.(i think ?)

I am confused about what to do with these page variations. Should I use rel canonical or something else?

chatgpt suggest me to noindex embed and download page instead

All of these URLs appear in search results, but for the embed iframe and download PDF pages, I don't want them to show up in search engine results. I only want this URL to be displayed:
https://website/index.php/category/article/view/75417


I am asking for advice and input from all my friends here. What is the best approach I can apply?

Thank you.

not2easy

12:00 pm on Aug 5, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hi busos and welcome to WebmasterWorld [webmasterworld.com]

You should first check that each of those URLs is *only* reachable at one URL. For example, can you view
https://example/index.php/category/article/view/75417
only at that URL or is it also viewable as
https://example/category/article/view/75417
?

Normally you would need to use something like https://example.com/index.php/category/article/view/75417 or https://example.net/index.php/category/article/view/75417

I am trying to understand the site structure where there is no actual domain TLD extension to suggest what steps might help.

lucy24

5:40 pm on Aug 5, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



https://website/index.php/category/article/view/75417
https://website/index.php/category/article/view/75417/40283
https://website/index.php/category/article/download/75417/40283
Assuming, first, that the one word “website” is a stand-in for “example.com” ...

Is it exactly these three patterns, or are there others? I assume the consistent 75417 is associated with the actual file being linked or embedded. But have you any idea where the 40283 comes from, and is it always that exact number or is it random?

The first thing you should do is set up a redirect, so that requests for

index.php/category/article/view/75417.+ (more stuff after the number)
and
index.php/category/article/download/75417.* (extra “download”, with or without more stuff after the number)
get redirected to
https://example.com/index.php/category/article/view/75417

If you do not know how to do this, ask in the Apache subforum (or IIS or whatever server type you are on). Is this a long-established site with URLs in a long-established format? Ordinarily you don't want anything except possibly a query string to come after “index.php”; in fact ordinarily you wouldn’t want “index.php” to be part of the visible URL at all. But I am not suggesting you go back and revamp your entire URL structure just to get rid of a single annoyance.

busos

8:54 am on Aug 6, 2023 (gmt 0)



@not2easy

hi not2easy

index.php is not the issue because the request to the URL
https://example/category/article/view/75417
will be redirected to
https://example/index.php/category/article/view/75417
.

The problem is that within 1 post, there are 2 versions as I explained above:

1.Embed the full post PDF:
https://example.com/index.php/category/article/view/75417/40283

2.Download the full article in PDF format:
https://example.com/index.php/category/article/download/75417/40283


The number 75417 represents the post ID, and 40283 represents the file ID.

Actually, there are many other variations, for example:

https://example.com/plugins/generic/pdfJsViewer/pdf.js/web/viewer.html?file=https://example.com/index.php/category/article/download/75417/40283

https://example.com/index.php/category/article/view/75417/40283?related_post_from


However, it seems that the root of the problem lies in the URL pattern I mentioned above.

Here are some examples:

 1.https://example.bc.edu/index.php/jtla/article/view/1603 = main post
2.https://example.bc.edu/index.php/jtla/article/view/1603/1455 = embed full article
3.https://example.bc.edu/index.php/jtla/article/download/1603/1455/1738 = download full article



[edited by: not2easy at 11:06 am (utc) on Aug 6, 2023]
[edit reason] Please see TOS [webmasterworld.com] [/edit]

not2easy

11:25 am on Aug 6, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I apologize for the severe edits but the URL format used in your original post was confusing and from this more recent post we can see URLs typical to WordPress.

This is the reason lucy24 asked
Assuming, first, that the one word “website” is a stand-in for “example.com”

If this is a WP site, you would set up your canonicals in your SEO plugin. This is because of the way WP generates URL versions, so to maintain one indexed version and not list the other version, you would use a good SEO plugin to handle the variations.

The plugin you are using to view the pdf (plugins/generic/pdfJsViewer/pdf.js/web/viewer.html ) should have its own settings to handle indexing.

busos

4:11 pm on Aug 6, 2023 (gmt 0)



I apologize in advance if I have caused any confusion. I attempted to edit my initial post to add information, but I didn't know how to do it.

Let me start over.

The content management system (CMS) I am using is not WordPress; it's OJS (Open Journal System).

In this CMS, whenever we publish an article, be it research, reviews, or regular articles, OJS generates three types of pages (as mentioned above), including the URL patterns it produces:

1. Summary or initial page:

https://website/index.php/category/article/view/75417


This page contains only a summary, such as download statistics, author details, an excerpt from the full article (250 - 500 words), a disclaimer about the article, and other information. There is a link leading to the second page, the "embed full article" page.

2. Embed full article page:

https://website/index.php/category/article/view/75417/40283


This page displays an embedded iframe of the full article, allowing visitors to read the article through their web browsers. There is a link leading to the third page, the "download full article" page.

3. Download full article page:

https://website/index.php/category/article/download/75417/40283


This page permits visitors to download the full article in PDF format to their hard drives.


Each of these pages generally has a problem called "Duplicate without user-selected canonical" because there are no meta tags like canonical, as I mentioned in my previous post.

So, my question is, what should I do regarding these variations of URLs I mentioned above?

and yes
website mean example.com


I hope I haven't confused my fellow friends at WebmasterWorld again.

not2easy

5:24 pm on Aug 6, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



OK, my confusion of the CMS was due to the /plugins/ directory which is typical in WordPress. It does help to know that it is not WP but OJS which I have not used. Please disregard the suggestion to deal with canonicals via the common WP SEO plugin.

I looked up some information about the OJS CMS and I see that they offer a "OJS 3.x Article Metadata Plugin" that allows you to add metadata to articles. That might be an option to allow you to add canonical metadata. Because I also noticed that they are primarily OJS hosted you may need to use their tools to do what you want as far as indexing. I am guessing you do not control your .htaccess file which could allow you to noindex files within a given directory? Because of its being opensource software, there may be other suppliers for various such plugin tools but they do not make it easy to determine how a person can evaluate such suppliers. When the pages are generated from a database, it limits your ability to modify page versions.

busos

2:43 am on Aug 7, 2023 (gmt 0)



Yes, as far as I know, this CMS is based on Bootstrap, and it automatically adds meta-data for each published article according to its type or category.

Adding additional meta tags is not a problem, as it is relatively easy to do. Now, returning to my initial question, what would be the best course of action regarding these variations of posts?

The full article and download embedded pages rank quite well in search results, but these pages are in PDF format.

Should I consider adding a rel canonical tag to the embedded and download pages that point to the summary/initial page? Or should I add a noindex tag to the embedded and download pages, etc.?

tangor

3:42 am on Aug 7, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Unless your "view pdf" page (step two) PREVENTS the user from right click save as, once the full pdf is displayed (in step two) it can be downloaded by any browser... Why a third page just to download?

busos

8:09 am on Aug 7, 2023 (gmt 0)



Unless your "view pdf" page (step two) PREVENTS the user from right click save as, once the full pdf is displayed (in step two) it can be downloaded by any browser... Why a third page just to download?


On the PDF viewing page, we have the option to click and select "Save As". However, the saved file is the HTML source code of the PDF viewing page itself, not the actual PDF.

Below is the file saved in HTML format:
<!DOCTYPE html>
<html lang="en-US" xml:lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>book review</title>


<link rel="icon" href="https://example.com/public/journals/73/favicon_en_US.jpg">
<meta name="generator" content="Open Journal Systems 3.0.2.0">
<link rel="stylesheet" href="https://example.com/public/journals/73/styleSheet.css" type="text/css" /><link rel="stylesheet" href="https://example.com/index.php/category/$$$call$$$/page/page/css?name=bootstrap" type="text/css" /><link rel="stylesheet" href="https://example.com/index.php/category/$$$call$$$/page/page/css?name=bootstrapTheme-journal" type="text/css" /><link rel="stylesheet" href="https://example.com/plugins/generic/orcidProfile/css/orcidProfile.css" type="text/css" />
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.js" type="text/javascript"></script><script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.11.0/jquery-ui.js" type="text/javascript"></script><script src="https://example.com/lib/pkp/js/lib/jquery/plugins/jquery.tag-it.js" type="text/javascript"></script><script src="https://example.com/plugins/themes/bootstrap3/bootstrap/js/bootstrap.min.js" type="text/javascript"></script><script src="https://example.com/js/plugins/citationFormats.js" type="text/javascript"></script><script type="text/javascript">var _paq = _paq || [];

</head>
<body class="pkp_page_article pkp_op_view">

<header class="header_view">

<a href="https://example.com/index.php/category/article/view/71582" class="return">
<span class="pkp_screen_reader">
Return to Article Details
</span>
</a>

<a href="https://example.com/index.php/category/article/view/71582" class="title">
book review
</a>

<a href="https://example.com/index.php/category/article/download/71582/50270/" class="download" download>
<span class="label">
Download
</span>
<span class="pkp_screen_reader">
Download PDF
</span>
</a>

</header>

<script type="text/javascript" src="https://example.com/plugins/generic/pdfJsViewer/pdf.js/build/pdf.js"></script>
<script type="text/javascript">

$(document).ready(function() {
PDFJS.workerSrc='https://example.com/plugins/generic/pdfJsViewer/pdf.js/build/pdf.worker.js';
PDFJS.getDocument('https://example.com/index.php/category/article/download/71582/50270/').then(function(pdf) {
// Using promise to fetch the page
pdf.getPage(1).then(function(page) {
var pdfCanvasContainer = $('#pdfCanvasContainer');
var canvas = document.getElementById('pdfCanvas');
canvas.height = pdfCanvasContainer.height();
canvas.width = pdfCanvasContainer.width()-2; // 1px border each side
var viewport = page.getViewport(canvas.width / page.getViewport(1.0).width);
var context = canvas.getContext('2d');
var renderContext = {
canvasContext: context,
viewport: viewport
};
page.render(renderContext);
});
});
});

</script>
<script type="text/javascript" src="https://example.com/plugins/generic/pdfJsViewer/pdf.js/web/viewer.js"></script>

<div id="pdfCanvasContainer">
<iframe src="https://example.com/plugins/generic/pdfJsViewer/pdf.js/web/viewer.html?file=https%3A%2F%2Fexample.com%2Findex.php%2Fcategory%2Farticle%2Fdownload%2F71582%2F50270%2F" width="100%" height="100%" style="min-height: 500px;" allowfullscreen webkitallowfullscreen></iframe>
</div>

</body>
</html>


The purpose of this page, in my opinion, is to allow visitors to read articles in full through their web browsers online. On the other hand, the purpose of the PDF download page is to enable visitors to download articles in PDF format for offline reading.

not2easy

11:12 am on Aug 7, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you want only https://website/index.php/category/article/view/75417 to be indexed, add canonical metatags to the other formats and noindex the other formats. If you submit a sitemap, remove the noindex versions from your sitemap and submit only the version you prefer to index.

BTW, most browsers offer a "Print" option and allow you to save as a pdf file.

tangor

12:07 pm on Aug 7, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The purpose of this page, in my opinion, is to allow visitors to read articles in full through their web browsers online. On the other hand, the purpose of the PDF download page is to enable visitors to download articles in PDF format for offline reading.


If I can see the PDF, and want it, I can get it---html or no.

Nothing wrong with how it is being done, just seems to be extra steps that introduce multiple urls that point to the same desired content (the pdf).