Creating a search engine in Python language

Page 1/8
Goal: Implement a complete search engine. Milestones Overview
Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component
#2 Complete Search System
Page 2/8
PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 – INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index:
Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2
Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. – words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query.  
COMPONENT 3 – RANKING:
At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes parameters to improve ranking other than tf-idf from the techniques discussed in class.
Milestone #1 (15 points) Goal: Build an index and a basic retrieval component By basic retrieval component; we mean that at this point you just need to be able to query your index for links (The query can be as simple as single word at this point). These links do not need to be accurate/ranked. We will cover ranking in the next milestone. At least the following queries should be used to test your retrieval: 1 – Informatics 2 – Mondego 3 – Irvine 4 – artificial intelligence 5 – computer science Note: query 4 and 5 are for milestone #2 Deliverables: Submit a report (pdf) in Canvas with the following content:
1. A table with assorted numbers pertaining to your index. It should have, at least the number of documents, the number of [unique] words, and the total size (in KB) of your index on disk.
2. Number of URLs retrieved for each of the queries above and listing the first 20 URLS for each query
Evaluation criteria:
_ Was the report submitted on time?
 _ Are the reported numbers plausible?
 _ Are the reported URLs plausible?
Milestone #2 (45 points and 6 pts Goal: complete search engine Deliverables:
_ Submit a zip file containing all the artifacts/programs you wrote for your search _ A live demonstration of your search engine
Evaluation criteria:
– Does your program work as expected of search engines? – How general are the heuristics that you employed to improve the retrieval? – How complete is the UI? (e.g. links to the actual pages, snippets, etc.)
Page 4/8
– Do you demonstrate in-depth knowledge of how your search engine works? Are you able to answer detailed questions pertaining to any aspect of its implementation?
Additional Information: Understanding the data dump: In Assignment-2, crawlers of all the groups collectively crawled 37,497 URLs. We collected these URLS and are providing them to you as ‘webpages_clean.zip’ file. This zip file contains the following:
1. bookkeeping.json 2. bookkeeping.tsv 3. Folders 0 to 74
Folders: The 37,497 URLs are organized into 75 folders, each folder having 500 files. Every file has the extracted HTML source code of a particular URL. Bookeeping files: bookkeeping.json and bookkeeping.tsv are two different formats of the same file. These files maintain a list of all the URLs that have been crawled. Every URL has an identifier associated with it. This identifier helps locate the HTML code of the URL. The identifier is of the format: “folder_number/file_number” For example, consider the entry on line 13 of bookkeeping.json:
“0/108”: “vision.ics.uci.edu/papers/RamananBK_ICCV_2007”
This means that the HTML code extracted for the link “vision.ics.uci.edu/papers/Ramanan BK_ICCV_2007” is located at folder 0, file number 108. Broken HTML: The HTML source code of the URLs may not be well formed. This means that the code may not necessarily have a pair of opening and closing tags. For example, there might be an open

Custom Paper Help
Calculate your paper price
Pages (550 words)
Approximate price: -

Why Work with Us

Top Quality and Well-Researched Papers

We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.

Professional and Experienced Academic Writers

We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.

Free Unlimited Revisions

If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.

Prompt Delivery and 100% Money-Back-Guarantee

All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.

Original & Confidential

We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.

24/7 Customer Support

Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.

Try it now!

Calculate the price of your order

Total price:
$0.00

How it works?

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Our Services

No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.

Essays

Essay Writing Service

No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.

Admissions

Admission Essays & Business Writing Help

An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.

Reviews

Editing Support

Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.

Reviews

Revision Support

If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.