Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to scrape PDF from CNKI #116

Open
lullabymia opened this issue Nov 30, 2018 · 14 comments
Open

How to scrape PDF from CNKI #116

lullabymia opened this issue Nov 30, 2018 · 14 comments
Labels
discussion Extra attention is needed

Comments

@lullabymia
Copy link

lullabymia commented Nov 30, 2018

Troubleshooting

Describe your environment

  • Operating system: mac
  • Python version: 3.7
  • Hardware: macbook
  • Internet access:
  • Jupyter notebook or not? [Y/N]: Y
  • Which chapter of book?: chapter 6

Describe your question

I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?

2018-11-30 5 49 28

Describe the efforts you have spent on this issue

I found this website about pdf scraping
https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

@hupili
Copy link
Owner

hupili commented Nov 30, 2018

@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs.

@lullabymia
Copy link
Author

lullabymia commented Dec 1, 2018

I followed the instruction in the article. But it seemed that I cannot install textract at the beginning.
2018-12-01 4 37 42

2018-12-01 4 37 55

2018-12-01 4 41 44

@hupili

@ChicoXYC
Copy link
Collaborator

ChicoXYC commented Dec 1, 2018

@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like filereader ocr

@lullabymia
Copy link
Author

We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 )
Do I need to download all the PDF before scraping?
@ChicoXYC

@ChicoXYC
Copy link
Collaborator

ChicoXYC commented Dec 1, 2018

@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf:
https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/extract-words-with-pdf.ipynb

@ChicoXYC
Copy link
Collaborator

ChicoXYC commented Dec 1, 2018

I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb

@hupili
Copy link
Owner

hupili commented Dec 3, 2018

Don't dwell on textract. Actually PyPDF2 alone already works as shown in this comment

@lullabymia
Copy link
Author

lullabymia commented Dec 10, 2018

But there is an error called
Multiple definitions in dictionary at byte 0x7eb1 for key /MediaBox
2018-12-10 3 22 43
2018-12-10 3 23 01
@ChicoXYC

@lullabymia
Copy link
Author

Besides of the above reading error, I cannot click the link of "download pdf" in the webpage
(and it seems that when I paste the link, it will download caj instead of pdf)
http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
2018-12-12 1 42 06

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click()

@ChicoXYC Can you also help me with that ?

@ChicoXYC
Copy link
Collaborator

@lullabymia You need use selenium instead of requests method to find elements. The following will help

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)

link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click()

@ChicoXYC
Copy link
Collaborator

@lullabymia

import PyPDF2
import os
path='pdfs/'   #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
    pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())

@lullabymia
Copy link
Author

Clicking pdf worked and thank you! But the error called "Multiple definitions in dictionary at byte 0xcd34 for key /MediaBox" appeared again while reading pdf files. :(
2018-12-12 7 38 21
2018-12-12 7 38 50
2018-12-12 7 39 13

@ChicoXYC

@ChicoXYC
Copy link
Collaborator

@lullabymia

  1. checkout whether only pdf files are in the folder.
  2. plz don't leave blank space in the folder name.
  3. If it doesn't work, you can find me in the cva 808 to work out face to face.

@hupili
Copy link
Owner

hupili commented Dec 14, 2018

redirect to #135

@ChicoXYC ChicoXYC added the discussion Extra attention is needed label Jan 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants