Wayback Machine Chrome extension now available

The Wayback Machine Chrome browser extension helps make the web more reliable by detecting dead web pages and offering to replay archived versions of them.  You can get it here.

For the past 20 years, the Internet Archive has recorded and preserved web pages, and hundreds of billions of them are available via the Wayback Machine.  This is good because we are learning the web is fragile and ephemeral.  For example a 2013 Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are now dead.  Those decisions affect everyone in the U.S., and the evidence the opinions are based on is disappearing.

When previously valid URLs don’t respond, but instead return a result code of 404, we call that link rot.  The Wayback Machine Chrome extension is designed to help mitigate against link rot and other common web breakdowns.  

By using the “Wayback Machine” extension for Chrome, users are automatically offered the opportunity to view archived pages whenever any one of several error conditions, including code 404, or “page not found,” are encountered.  If those codes are detected, the Wayback Machine extension silently queries the Wayback Machine, in real-time, to see if an archived version is available.  If one is available, a notice is displayed via Chrome, offering the user the option to see the archived page.

The Internet Archive considers the privacy of our users to be of critical importance. We try not to record IP addresses, and we have fought National Security letters.  You can rest assured that the use of the Wayback Machine Chrome extension will not expose your browsing history.  In addition we are in conversation with Google about adding a proxy server as an additional layer of protection.

Thank you for giving the Wayback Machine for Chrome extension a try.  We are committed to supporting better web browsing experiences and welcome your feedback and suggestions about how we can improve.  Please send us your bug reports, feature requests and other feedback directly to info@archive.org.

Posted in News | Leave a comment

Internet Archive’s Trump Archive launches today

The Trump Archive launches today with 700+ televised speeches, interviews, debates, and other news broadcasts related to President-elect Donald Trump, created using the Internet Archive’s TV News Archive.

A work in progress, the growing collection now includes more than 520 hours of Trump video. The earliest excerpt dates from December 2009, and the collection continues through the present. It includes more than 500 video statements fact checked by FactCheck.org, PolitiFact, and The Washington Post’s Fact Checker covering such controversial topics as immigration, Trump’s tax returns, Hillary Clinton’s emails, and health care.

Full list of fact checks with links to video statements in TV News Archive.
Note: We are working to update this spreadsheet with improved links. Stay tuned.

Visit the Trump Archive.

Reporters, researchers, Wikipedians, and the general public are invited to quote, compare and contrast televised statements made by Trump.

  • Use clips in your articles and videos.
  • Create supercuts on topics like Trump’s perspectives of the US press, made with our online “Popcorn” video editor.  
  • Let us know what content we are missing.  
  • If you have the technical resources, help us enhance search and discovery by collaborating in experiments to apply artificial intelligence-driven facial recognition, voice identification, and other video content analysis approaches.
  • How would you like to use such an archive?  Comment below, or write us info@archive.org

Why a Trump Archive?

We draw on this material, and our experience with building the successful Political TV Ad Archive, to create a curated collection of material related to Trump, with an emphasis on fact-checked statements. The video is searchable, quotable, and shareable on social media.

In response to requests by our fact checking partners on the Political TV Ad Archive project and other media, we hope to provide assistance for those tracking Trump’s evolving statements on public policy issues.

For example: in July 2016, Trump told ABC’s George Stephanopoulos, “I have no relationship with Putin…I don’t think I’ve ever met him.” Stephanopoulos pressed him on this point during the interview, saying that Trump had previously claimed a relationship with him. PolitiFact ruled this statement by Trump as a “full flip flop”: “Trump’s denial of a relationship with Putin contradicted what he had said on multiple previous occasions.”

By providing a free and enduring source for TV news broadcasts of Trump’s statements, the Internet Archive hopes to make it more efficient for the media, researchers, and the public to track Trump’s statements while fact-checking and reporting on the new administration. The Trump Archive can also serve as a rich treasure trove of video material for any creative use: comedy, art, documentaries, wherever people’s inspiration takes them.

We consider the Trump Archive to be an experimental model for creating similar archives for other public officials. For example, we’ll explore the idea of creating curated collections for Trump’s nominees to head federal agencies; members of Congress of both parties (for example, perhaps the Senate and House majority and minority leadership); Supreme Court nominees, and so on.

While we’ve largely hand-curated this collection, we hope to collaborate with researchers to apply machine intelligence to expand this collection, building others and making search of our entire TV library vastly more efficient.

Such experimentation builds on our experience with first prototyping and then developing the the Political TV Ad Archive. Our first collection of political TV ads, covering ads aired in Philadelphia during the 2014 mid-term elections, was built largely by hand. However, in preparation for the Political TV Ad Archive, we created a new open source tool, the Duplitron, that was able to identify ad airings by deploying audio fingerprinting. During the course of the project, we collected nearly 3,000 ads and documented more than 364,000 ad airings.

Why now?

Just because something is broadcast or posted on the internet doesn’t mean it’s forever. Reporters and the public may take it for granted that a news story or a piece of broadcast video is only a google search away, but as newspapers, companies, and organizations fail and change, often vital information is lost. The web is far more fragile than is generally understood.

The Internet Archive’s core mission is to preserve and make accessible our cultural heritage. For example, the Wayback Machine preserves websites over time, so if pages or sites are deleted, they can still be found. For example, Rachel Maddow of MSNBC reported on how the president-elect had deleted a web page from the official transition website that had touted Trump properties.

We also preserve political and news content through the TV News Archive, which contains news broadcasts by major networks back to 2009, searchable via closed captioning. The Political TV Ad Archive archives 2016 election ads along with relevant fact checks and follow-the-money reporting by our journalism partners. Our Political Campaign web archive is preserving election-related online media, such as select candidate and political groups’ websites and Twitter and Instagram feeds.

What’s next

The Trump Archive is a work in progress; we will continue to refine the content. We hope to work with others to broaden the materials available, to make search more efficient, and otherwise make it more useful for the public. We’d like you feedback and suggestions.

The great American author William Faulkner wrote, “The past is never dead. It’s not even past.” We believe that the Trump Archive, in preserving the past, can help the public engage more knowledgeably with our future.

Many thanks to the thoughtful contributions of Robin Chin, Jessica Clark, Katie Dahl, Katie Donnelly, John Gonzalez, Wendy Hanamura, Tracey Jaquith, Jeff Kaplan, Roger Macdonald, Ralf Muehlen, Craig Newmark, Sylvia Paull, Alexis Rossi, Dan Schultz, Nancy Watzman, our Partners & Funders and the Vanderbilt Television News Archive – on whose shoulders we stand.

Posted in Announcements, News | Tagged , , , , , , , , , , | 35 Comments

Join us for a White House Social Media and Gov Data Hackathon!

gov_hackathonJoin us at the Internet Archive this Saturday January 7 for a government data hackathon! We are hosting an informal hackathon working with White House social media data, government web data, and data from election-related collections. We will provide more gov data than you can shake a script at! If you are interested in attending, please register using this form. The event will take place at our 300 Funston Avenue headquarters from 10am-5pm.

We have been working with the White House on their admirable project to provide public access to eight years of White House social media data for research and creative reuse. Read more on their efforts at this blog post. Copies of this data will be publicly accessible at archive.org. We have also been furiously archiving the federal government web as part of our collaborative End of Term Web Archive and have also collected a voluminous amount of media and web data as part of the 2016 election cycle. Data from these projects — and others — will be made publicly accessible for folks to analyze, study, and do fun, interesting things with.

At Saturday’s hackathon, we will give an overview of the datasets available, have short talks from affiliated projects and services, and point to tools and methods for analyzing the hackathon’s data. We plan for a loose, informal event. Some datasets that will be available for the event and publicly accessible online:

  • Obama Administration White House social media from 2009-current, including Twitter, Tumblr, Vine, Facebook, and (possibly) YouTube
  • Comprehensive web archive data of current White House websites: whitehouse.gov, petitions.whitehouse.gov, letsmove.gov and other .gov websites
  • The End of Term Web Archives, a large-scale collaborative effort to preserve the federal government web ( .gov/.mil) at presidential transitions, including web data from 2008, 2012, and our current 2016 project
  • Special sub-collections of government data, such as every powerpoint in the Internet Archive’s web archive from the .mil web domain
  • Extensive archives of of social media data related to the 2016 election including data from candidates, pundits, and media
  • Full text transcripts of Trump candidate speeches
  • Python notebooks, cluster computing tools, and pointers to methods for playing with data at scale.

Much of this data was collected in partnership with other libraries and with the support of external funders. We thank, foremost, the current White House Office of Digital Strategy staff for their advocacy for open access and working with us and others to make their social media open to the public. We also thank our End of Term Web Archive partners and related community efforts helping preserve the .gov web, as well as the funders that have supported many of the collecting and engineering efforts that makes all this data publicly accessible, including the Institute of Museum and Library Services, Altiscalethe Knight Foundation, the Democracy Fund, the Kahle-Austin Foundation, and others.

Posted in Announcements, News | Tagged , , , , , , | 7 Comments

Would Like to Archive Government Web Services, not just Web Sites– Please help

Archiving .gov and .mil websites is going on now, with lots of help—but what if we could archive full government web services? This would mean keeping interactive sites that include databases and forms, available for future use even if the original website changes or is removed.

We like this idea because we would preserve how websites worked, not just what they looked like. As websites become more database driven and interactive, this would be a bigger help than the already helpful Wayback Machine.

We believe this is possible now given the increased use of virtual machines and cloud services. Webmasters are adjusting to having their systems work in an isolated environment and one that can be snapshot’d.

What we need are some webmasters who would like to try this. We think that government websites would be perfect because they tend to change as administrations change and the datasets are often public data.

If you run a website and would like to participate in this experiment or would like to help on the receiving end, please send a note to info@archive.org or reply to this post.

Archiving web services could usher in a completely new age in archiving of Internet resources.

 

 

Posted in Announcements, News | 4 Comments

A Year-end Message from the TV News Archive

by Katie Donnelly

Over the past extremely unpredictable election year, the Internet Archive invented new methods and tools to give journalists, researchers, and the public the power to access, scrutinize, share, and thoroughly fact-check political ads, presidential debates, and TV news broadcasts.

Our efforts were designed to help citizens better understand the patterns of political messages designed to persuade them and find factual, reliable information in what is disturbingly being seen as a “post-truth” world.

The Political TV Ad Archive project proved to be highly useful to our high-profile fact-checking partners, as well as reporters at an array of outlets including The New York Times, The Washington Post, FOX News, The Economist, The Atlantic, and more. By providing data about when, where, and how many times political ads aired on TV in key markets, the project unlocked new creative potential for data reporters to analyze how campaigns and outside groups were targeting messages to voters in different locations.

Breaking events, like political debates and speeches, also offered a chance for archived TV content to shine, allowing reporters to isolate and share clips in near-real time, and fact-checkers to harvest dubious statements for further exploration. In addition, the project’s experience with developing audio fingerprinting (through a new invention we call the Duplitron) for identifying instances of ads inspired a new use: tracking candidate debate sound bites in subsequent TV news shows.

In this way, reporters and researchers were able to analyze and report on which political statements were trending across different TV programs. This provided a way to show how political statements were trending across various networks, revealing the ideological, and agenda-setting and other editorial choices made by news producers about what issues to highlight and overlook.

screenshot-2016-12-19-13-21-14

As Roger Macdonald, director of the TV News Archive, wrote to project partners: “Citizens will increasingly hunger for sound information to inform wise electoral decisions. With our Republic being riven by increasing socio-political chaos and infectious divisions, whose magnitude has not been seen since before our Civil War, we think there are uncommon opportunities to serve citizens with the information for which they will increasingly yearn. We have an historic opportunity to thoughtfully place some grains of sand on the balance pan of reason.”

The project was supported by a generous grant from the Knight News Challenge, funded in partnership with the Knight Foundation, the Democracy Fund, the Hewlett Foundation and the Rita Allen Foundation, and received additional support from the Rita Allen Foundation, the Democracy Fund, PLCB Foundation, Craig Newmark, Christopher Buck, and others

Here is a quick look at project accomplishments:

Political TV Ad Archive

  • Total number of archived ad views, most embedded in partner sites: 2,036,063
  • Number of ads collected: 2,991
  • Political ads broadcast 364,822 times over 26 markets
  • Number of fact and source checks: 131
  • Press coverage: 156 articles

Katie Donnelly is associate director at Dot Connectors Studio, a Philadelphia-based strategy firm that has worked with the Political TV Ad Archive.

Posted in News | Tagged , , , , , , , , , , | Comments Off on A Year-end Message from the TV News Archive

New Research Tool for Visualizing Two Million Hours of Television News

Guest post by Kalev Leetaru

Today the Internet Archive announces a new interactive timeline visualization–the Television Explorer–that lets you trace how any keyword–think “emails”, “tax returns”, “alt-right”–has been covered on U.S. television news over the past half-decade.

See the Television Explorer, a new tool for exploring TV News.

screenshot-2016-12-19-09-50-09

Over the past year and a half, the GDELT Project and the Internet Archive’s Television News Archive have worked closely together to visualize how U.S. television news has covered the contentious 2016 political campaign.

One of the tools we created was the 2016 Candidate Television Tracker, which used closed captioning to count how many times each of the presidential candidates was mentioned on television and offered a day-by-day timeline showing the ebbs and flows of who was “winning” the free media wars. (Answer: President-elect Donald Trump.) This tool was used by such media outlets as The Atlantic, The Washington Post, FiveThirtyEight, Politico and The Guardian, among many others.

Now we are adapting this tool to allow more sophisticated searches: rather than just the presidential candidates, now you can trace television news coverage of any keyword of your choosing. You can even run advanced searches that find words in conjunction with other works or phrases, such as finding mentions of Hillary Clinton that also discuss her email server. All search results are available for download via CSV and JSON export, making it possible for data journalists, researchers, and advocates to fine tune their analysis of the data.

When searching, you get back a visual timeline showing how often that word or phrase has appeared on American television news over the past half-decade. Nearly two million hours of television news totaling more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present (though not all stations were monitored for the entire period) are searchable in this interface.

Unlike the Internet Archive’s Television New Archive interface, which returns results at the level of an hour or half-hour “show,” the interface here reaches inside of those six and a half years of programming and breaks the more than one million shows into individual sentences and counts how many of those sentences contain your keyword of interest. Instead of reporting that CNN had 24 hour-long shows yesterday that mentioned Donald Trump one or more times, the interface here will count how many sentences uttered on CNN yesterday mentioned his name–a vastly more accurate metric for assessing media attention.

Explore how CNN covered the presidential campaign of 2012 versus 2016 and understand just how big of a media event this year’s election really was. See precisely when Edward Snowden burst onto the scene and how Wikileaks got more coverage during the 2016 presidential election than its debut in 2010. Watch the seasonal spikes of Thanksgiving, or see how ebola received little attention, even as thousands died in Africa, becoming a topic only after the first Americans became infected.

Using the “near” search feature, plot coverage of Wikileaks that also mentioned either “Podesta,” “email,” or “emails” nearby and discover that FOX paid far more attention to the DNC and Podesta email hacks than CNN, MSNBC, CNBC or Bloomberg. In contrast, CNN focused more intensely on the Trayvon Martin shooting (Aljazeera America and Bloomberg were not yet being monitored by the Archive), while Aljazeera led coverage of the Michael Brown and Eric Garner deaths.

screenshot-2016-12-19-09-53-55

Search of term “Wikileaks” near Podesta, emails, Clinton

Search for “ivory” to see that Aljazeera America (which ceased operation in April 2016) devoted vastly more of its coverage to elephant poaching in Africa than any other monitored national network. It also paid the most attention to “Africa” and to the “refugee” crisis. On the other hand, Bloomberg has devoted much more of its time to “China” and to the economic crisis in “Greece” last year.

We look forward to seeing what people do with this new tool Please share your favorite searches on Twitter with the hashtag “#internetarchivetvsearch”. If you have any questions, please email kalev.leetaru5@gmail.com or nancyw@archive.org.

Kalev Leetaru is an independent data journalist. 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , , , , , , , | 3 Comments

Robots.txt Files and Archiving .gov and .mil Websites


The Internet Archive is
collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. Some have asked if we ignore URL exclusions expressed in robots.txt files.

The answer is a bit complicated.  Historically, sometimes yes and sometimes no; but going forward the answer is “even less so.”

mollymonsterRobots.txt files live on the top level of a website at a url like this: https://example.com/robots.txt. This standard was developed in 1994 to guide search engine crawlers in a variety of ways, including some areas to avoid crawling.   This standard is used by Google, for instance.

These files were useful 20 years ago for the Internet Archive’s crawlers, but have become less and less so over the years because many sites have not actively maintained the files from the point of view of archiving. Also, large websites or hosted websites often do not make it easy for their users to edit these files, and large websites increasingly guide or block crawlers with technological measures. Another problem is knowing when a domain name changes hands, so a current robots.txt file is not relevant to a different era. As time has gone on, for those who want to exclude their sites we encourage webmasters to send exclusion requests to info@archive.org and encourage them to specify what time period they apply to.

Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies.  We have had little or no negative feedback on this, and little or no positive feedback — in fact little feedback at all. The Wayback Machine has also been replaying the captured .gov and .mil webpages for some time in the beta wayback, regardless of robots.txt.   

Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future.

Posted in News, Wayback Machine, Web Archive | 3 Comments

Preserving U.S. Government Websites and Data as the Obama Term Ends

Long before the 2016 Presidential election cycle librarians have understood this often-overlooked fact: vast amounts of government data and digital information are at risk of vanishing when a presidential term ends and administrations change.  For example, 83% of .gov pdf’s disappeared between 2008 and 2012.

That is why the Internet Archive, along with partners from the Library of Congress, University of North Texas, George Washington University, Stanford University, California Digital Library, and other public and private libraries, are hard at work on the End of Term Web Archive, a wide-ranging effort to preserve the entirety of the federal government web presence, especially the .gov and .mil domains, along with federal websites on other domains and official government social media accounts.

While not the only project the Internet Archive is doing to preserve government websites, ftp sites, and databases at this time, the End of Term Web Archive is a far reaching one.

The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. The effort is likely to preserve hundreds of millions of individual government webpages and data and could end up totaling well over 100 terabytes of data of archived materials. Over its full history of web archiving, the Internet Archive has preserved over 3.5 billion URLs from the .gov domain including over 45 million PDFs.

This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.

The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.

“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving.  “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”

This year has also seen a significant increase in citizen and librarian driven “hackathons” and “nomination-a-thons” where subject experts and concerned information professionals crowdsource lists of high-value or endangered websites for the End of Term archiving partners to crawl. Librarian groups in New York City are holding nomination events to make sure important sites are preserved. And universities such as  The University of Toronto are holding events for “guerrilla archiving” focused specifically on preserving climate related data.

We need your help too! You can use the End of Term Nomination Tool to nominate any .gov or government website or social media site and it will be archived by the project team.   If you have other ideas, please comment here or send ideas to info@archive.org.   And you can also help by donating to the Internet Archive to help our continued mission to provide “Universal Access to All Knowledge.”

Posted in Announcements, News | Tagged , , , | 12 Comments

Internet Archive Canada and National Security Letter in the news: roundup

The Internet Archive garnered major media attention over the past week, first, on our plan to create a Canadian copy, and second, on the news we received a National Security Letter (NSL) requesting personal information about a user, the second in our history.

Canadian copy

Brewster Kahle’s post explaining why, in light of the new administration, the Internet Archive is raising money to build a copy of its collections in Canada hit a nerve.  More details were in a FAQ.

On November 29, Rachel Maddow led her MSNBC show with a segment about how the Internet Archive’s Wayback Machine helps reporters by preserving a record of what politicians say online, even when they later delete it.

One of her main examples: how soon after winning the election, President-elect Donald Trump’s official federal transition web page included a “rundown ….of all of the ‘world’s top properties that Donald Trump’s owns.”

The website has since been deleted, Maddow noted.

Maddow also called the Internet Archive, a “national treasure…an international treasure.” (We’re blushing.)

Meanwhile, Paul Sawers noted in Venture Beat:

 Given that lies and fake news played a crucial part in the 2016 U.S. presidential election narrative, it is somewhat notable that the Internet Archive had launched the Political TV Ad Archive back in January to help journalists fact-check claims made during political campaigning.

In The Washington Times, Andrew Blake wrote about the Internet Archive’s plans to create a Canadian copy and also reported:

Mr. Trump’s office did not immediately respond to a request for comment Wednesday. Prior to being elected president, however, the Republican businessman suggested taking action to prevent Americans from becoming radicalized online by the Islamic State terror group’s social media recruitment efforts.

Here’s a link to Trump’s speech referenced by The Washington Times.

Sam Thielman reported in The Guardian on challenges facing libraries generally, including the Internet Archive’s decision to create a Canadian copy of data. The piece also discusses how the New York Public Library has changed its privacy policies to assure readers that it will not keep user data longer than expected.

Other media outlets reporting on the Internet Archive’s news include NBC News, the BBC, the New RepublicRecode Daily, and Newsweek.

Increasing transparency on National Security Letters

Last week the Internet Archive also revealed we received a National Security Letter (NSL), requesting we turn over personal information about a particular user, the second in our history. We worked with the Electronic Frontier Foundation (EFF) to challenge the letter and gain the right to release it in redacted form; in the process, we also highlighted an error in the NSL about the right to appeal, which may have affected thousands of other letters.

Kim Zetter, a reporter for The Intercept, reported at length about how the Internet Archive took the unusual step of challenging the NSL–and won:

Now, Kahle and the archive are notching another victory, one that underlines the progress their original fight helped set in motion. The archive, a nonprofit online library, has disclosed that it received another NSL in August, its first since the one it received and fought in 2007. Once again it pushed back, but this time events unfolded differently: The archive was able to challenge the NSL and gag order directly in a letter to the FBI, rather than through a secretive lawsuit. In November, the bureau again backed down and, without a protracted battle, has now allowed the archive to publish the NSL in redacted form.

Dhrumil Mehta of FiveThirtyEight.com reported on the error exposed by the Internet Archive and the EFF–namely, the NSL incorrectly described the means for possible appeals of the gag order preventing an organization that has received such a letter from publicizing it. Mehta has filed a Freedom of Information Act request (FOIA) to find out how many letters sent out by the Federal Bureau of Investigation (FBI) contain this error:

This letter was particularly troublesome to privacy advocates because it contained misinformation about the rights of a letter recipient to challenge the nondisclosure requirement. The letter stated that the Internet Archive could “make an annual challenge to the nondisclosure requirement.” The Electronic Frontier Foundation, an advocacy organization that is legally representing the Internet Archive, pointed out in a press release that the passage of the USA Freedom Act in June of 2015 changed the law to allow letter recipients to challenge the National Security Letter at any time, not just once annually. In response to the EFF’s claim, the FBI withdrew its National Security Letter, allowed the Internet Archive to publish a redacted version of the letter containing the error and promised to correct the mistake by informing everyone else who got the same erroneous language.

It’s not just us

Tim Johnson of McClatchyDC drew all the themes together, linking the Internet Archive’s Canada announcement, the news on the NSL, and actions other library organizations are taking, all in one piece.

It turns out the nonprofit Internet Archive isn’t alone in taking action.

The New York Public Library announced a change this week to its privacy policy, informing users that it would retain less information about their activities.

The American Library Association, headquartered in Chicago, embraced that move and encourages others, including telling public libraries to encrypt all communications and lock up stored data to protect it from a prying government.

 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , | 15 Comments

FAQs about the Internet Archive Canada

Responses from Brewster Kahle, Founder & Digital Librarian of the Internet Archive

Based on interest from our letter that mentioned our raising money to make a copy of Internet Archive’s digital collections in Canada, press and others have asked a bunch of good questions. Here is a compendium of our answers:

Q. Were you working on a back-up before the election of Trump?
Yes, we have a partial copy of the Internet Archive in Alexandria, Egypt, and in Amsterdam, the Netherlands.

And also before the election we had been planning with the University of Toronto and University of Alberta to host the materials digitized from Canadian libraries at the Internet Archive Canada, which is a completely separate nonprofit from ours.

The statements by Trump on the campaign trail (see below) have ramped us into higher gear, moving us further and faster than we would have. The election led us to think bigger.

Q. Was there anything specific about Trump’s win that made you want to step up your game in terms of a backup archive? What in particular concerns you about what he has said/done? What potential risks do you see?
Upon his election we looked through our archive to find what his stand might be on the Internet policies and found announcements.

At this point, I think it would be prudent to take President-elect Trump at his word. Here are some of his statements, preserved in our Television News Archive. https://archive.org/tv

CNN Republican Presidential Debate
CNN December 15, 2015
Wolf Blitzer: Mr. Trump, are you open to closing parts of the internet?
Donald Trump: I would certainly be open to closing areas where we are at war with somebody. I sure as hell don’t want to let people that want to kill us and kill our nation use our internet. Yes, sir, I am.

https://archive.org/details/CSPAN_20151208_063000_Key_Capitol_Hill_Hearings
Donald Trump quote at a campaign rally at the USS Yorktown in South Carolina CSPAN broadcast speech on December 8, 2015
Donald Trump: So the press has to be responsible. They’re not being responsible, because we are losing a lot of people because of the internet. We have to do something. We have to go see Bill Gates and a lot of different people that really understand what is happening. We have to talk to them, maybe in certain areas, closing that internet up in some way. Some of you will say, “Oh, freedom of speech, freedom of speech.” these are foolish people. We have a lot of foolish people. We have a lot of foolish people. We have got to maybe do something with the internet because they are recruiting by the thousands.

Donald Trump on freedom of the press:
https://archive.org/details/R_macdonald-trumpOnPressV6

Q. How does this work? What goes into creating a backup of this magnitude (in whatever brief lay terms you can condense it to)?
There are stages we can take to achieve our overall goal. The first stage would be done with the University of Toronto and University of Alberta: to make a copy of what has been digitized from these Canadian collections (books and microfilm) and move that onto their university servers.

The next stage is to create a partial mirror at the Internet Archive Canada, which we have been planning to do.

Then the next stage is to create a “backup copy” in Canada for researchers. The best case scenario would be to have an active organization running a live copy of as much of the Internet Archive’s collections as makes sense. This is what we would like to do.

Q: Is there a specific dollar amount that you are aiming for?
To build a running archive in Canada will cost approximately $5 million, which is our goal. But we can take steps in this direction with less. Then there is ongoing support.

Q: How will you raise the money?
Great question. We are asking for donations from our users and supporters. Donations to the Internet Archive are tax-deductible in the US and can be made at https://archive.org/donate/

Q. What is the Internet Archive of Canada? Can I make a donation to it?
The Internet Archive Canada is a Not-For-Profit Corporation, registered under number 435509-1. It has been running for years and employs 11 book scanners in Toronto and Alberta. It is not a registered public charity, and donations are tax-deductible on donors’ US income only. To donate, please send cheques to:

Internet Archive Canada
130 St. George St.
Suite 7001
Toronto, ON M5V 3T5
CANADA

Q. What does it mean when you say you archive the “Internet.” Is this national? Or is it a global endeavor?
The Internet Archive archives many things: books, music, video, webpages, television and makes these materials available for free on the archive.org, openlibrary.org, and archive-it.org sites.  Take, for instance, the scope of our Web archiving in the Wayback Machine: https://archive.org/web. It houses a massive archive of over 250 billion web pages, made up of many collections. The Wayback Machine is freely accessible to anyone and it is used by hundreds of thousands of people every day. It is a global project to archive these pages.

Q. What else does the Internet Archive preserve, beyond the Wayback Machine?
The Internet Archive is a non-profit digital library founded by Brewster Kahle in 1996 with the mission to provide “Universal access to all Knowledge.” The organization seeks to preserve the world’s cultural heritage and to provide open access to our shared knowledge in the digital era, supporting the work of historians, scholars, journalists, students, the blind and reading disabled, as well as the general public. The Internet Archive’s digital collections include more than 26 petabytes of data: 279 billion web pages, moving images (2.2 million films and videos), audio (2.5 million recordings, 140,000 live concerts), texts (8 million texts including 3 million digital books), software (100,000 items) and television (3 million hours). Each day, 2-3 million visitors use or contribute to the Internet Archive, making it one of the world’s top 250 sites. It has created new models for digital conservation by forging alliances with more than 450 libraries, universities and national archives around the world.

Posted in News | 30 Comments

Help Us Keep the Archive Free, Accessible, and Reader Private

The Web Needs a MemoryThe history of libraries is one of loss.  The Library of Alexandria is best known for its disappearance.

Libraries like ours are susceptible to different fault lines:

Earthquakes,

Legal regimes,

Institutional failure.

So this year, we have set a new goal: to create a copy of Internet Archive’s digital collections in another country. We are building the Internet Archive of Canada because, to quote our friends at LOCKSS, “lots of copies keep stuff safe.” This project will cost millions. So this is the one time of the year I will ask you: please make a tax-deductible donation to help make sure the Internet Archive lasts forever. (FAQ on this effort).

On November 9th in America, we woke up to a new administration promising radical change. It was a firm reminder that institutions like ours, built for the long-term, need to design for change.

For us, it means keeping our cultural materials safe, private and perpetually accessible. It means preparing for a Web that may face greater restrictions.

It means serving patrons in a world in which government surveillance is not going away; indeed it looks like it will increase.

Throughout history, libraries have fought against terrible violations of privacy—where people have been rounded up simply for what they read.  At the Internet Archive, we are fighting to protect our readers’ privacy in the digital world.

We can do this because we are independent, thanks to broad support from many of you. The Internet Archive is a non-profit library built on trust. Our mission: to give everyone access to all knowledge, forever. For free. The Internet Archive has only 150 staff but runs one of the top-250 websites in the world. Reader privacy is very important to us, so we don’t accept ads that track your behavior.  We don’t even collect your IP address. But we still need to pay for the increasing costs of servers, staff and rent.

You may not know this, but your support for the Internet Archive makes more than 3 million e-books available for free to millions of Open Library patrons around the world.

Your support has fueled the work of journalists who used our Political TV Ad Archive in their fact-checking of candidates’ claims.

It keeps the Wayback Machine going, saving 300 million Web pages each week, so no one will ever be able to change the past just because there is no digital record of it. The Web needs a memory, the ability to look back.

If you find our work has been useful to you, please take a minute to donate whatever you can afford today. Help ensure the Internet Archive lasts forever.  I promise you—It will be money well spent.

Posted in Announcements | 263 Comments

National Security Letter to Us from FBI Includes Error – How Many Like it were Sent to Others?

The Internet Archive, with the help of the Electronic Frontier Foundation (EFF) is making public the second National Security Letter (NSL) issued to the Archive in our history (we received our first NSL in 2007 and successfully contested it with help from EFF and the ACLU). In response to our challenging this new NSL, the FBI has agreed to correct its standard NSL template and send clarifications about the law to potentially thousands of communications providers who have received NSLs in the last year and a half.

NSLs are a controversial tool that the FBI uses to demand specific types of private account information from service providers without a judge’s prior approval. NSLs also come with a gag order on the recipient. Their constitutionality is currently being litigated in courts.

The NSL we received includes incorrect and outdated information regarding the options available to a recipient of an NSL to challenge its gag. Specifically, the NSL states that such a challenge can only be issued once a year. But in 2015, Congress did away with that annual limitation and made it easier to challenge gag orders. The FBI has confirmed that the error was part of a standard NSL template and other providers received NSLs with the same significant error. We don’t know how many, but it is possibly in the thousands (according to the FBI, they sent out around 13,000 NSLs last year). How many recipients might have delayed or even been deterred from issuing challenges due to this error? Thankfully, the FBI says that they will now be issuing corrections regarding the law. You can see their letter to us here.

Publishing this NSL is also important because only a few have ever been made public due to their across-the-board gag restriction, in spite of the fact that hundreds of thousands of NSLs have been issued since 2001.

Information regarding the individual targeted by this NSL and the issuing office is redacted in the version that we are releasing. We didn’t find any documents in our records responsive to the NSL, so nothing was turned over.

We are deeply appreciative for the assistance of EFF in this matter, enabling us to make public an example of a mostly obscured practice with very significant implications for individual privacy and civil liberties.  See EFF’s press release  as well their excellent collection of blog posts for more background and analysis.

Posted in Announcements, News | 11 Comments

The Internet Arcade becomes an Archive Reality

A couple years back, we introduced the Internet Arcade, which enabled people around the world to play a number of Arcade titles from the last 40 years in their browsers, instantly. We’ve also had collections of console games, and a general library of tens of thousands of software programs which has also proven very popular.

The work continues to expand the emulated systems and refresh what titles are available, but a project we’ve had going on the side for a while just came to fruition.

Among the organizations that turned out to benefit from having our browser-based emulations was X-Arcade, manufacturers of high-quality joysticks and control panels for use with computers and software. Meant to have the original Arcade feel, a few examples of these controllers were gifted to the Archive and we’ve used them pretty extensively in demonstration days and special events.

Last year, X-Arcade announced an old-school full-sized arcade machine case for sale, and generously offered to send one to the Archive as well. We contacted an excellent artist, Mar Williams of Sudux.com, who has done excellent art for the DEFCON hacking conference and many other events, and she put together custom Internet Archive-themed arcade side art for the machine. Here’s what she came up with:

ia-mockup

The machine has made its way through shipping and moving companies and arrived at the Internet Archive’s 300 Funston Avenue headquarters in great shape, along with all the electronics and parts to make it go soon.

It’s one thing to see a mockup, and another to see the actual machine in your lobby:

img_2662

Over the next few weeks, the system will be set up to run with the Internet Archive systems and provide a really nice demonstration station for the many guests and visitors we see. It really jazzes up the place!

In the meantime, we’re now providing you with links to download the artwork files, in case you want to use them yourself.

Thanks again to X-Arcade for the lovely addition to our lobby, and to Mar Williams for such fantastic art!

layout_preview

Posted in Announcements, Cool items, Emulation | 9 Comments

Please: Help Build the 2016 U.S. Presidential Election Web Archive

seal_of_the_president_of_the_united_states-svgHelp us build a web archive documenting reactions to the 2016 Presidential Election. You can submit websites and other online materials, and provide relevant descriptive information, via this simple submission form. We will archive and provide ongoing access to these materials as part of the Internet Archive Global Events collection.

Since its beginning, the Internet Archive has worked with a global partner community of cultural heritage institutions, researchers and scholars, and citizens to build crowdsourced topical web archives that preserve primary sources documenting significant global events. Past collections include the Occupy Movement, the 2013 US Government Shutdown, the Jasmine Revolution in Tunisia, and the Charlie Hebdo attacks. These collections leverage the power of individual curators and motivated citizens to help expand our collective efforts to diversity and augment the historical record. Any webpages, sites, or other online resources about the 2016 Presidential Election are in scope. This web archive will build upon our affiliated efforts, such as the Political TV Ad Archive, and other collecting strategies, to provide permanent access to current political events.

As we noted in a recent blog post, the Internet Archive is “well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.” You can help us in this mission by submitting websites that preserve the online record of this unique historical moment.

Posted in Announcements, Archive-It, News, Web Archive | 8 Comments

US Election Results

I am a bit shell shocked– I did not think the election would go the way it did.   I want to reassure everyone– we are safe– funding, mission, partners have no reason to change.   I find this reassuring, hopefully you do as well.

As we take the next weeks to have this sink in, I believe we will come to find we will have new responsibilities, increased roles to play, in keeping the world an open and free environment.

We are well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.

I look forward to working with our staff, our partners, and the new partners that this creates, to see what our role should be to build the best damn library we can to serve the Maximum Public Good.

Over the next couple of weeks, please think through what we might do.  Looking forward to your ideas.

yours,

Brewster Kahle
Digital Librarian
brewster@archive.org

Posted in Announcements | 9 Comments

Aaron Swartz Weekend

by Lisa Rein, Cofounder and Coordinator, Aaron Swartz Day

In memory of Aaron Swartz, whose social, technical, and political insights still touch us daily, Lisa Rein, in partnership with the Internet Archive, will be hosting a weekend of events on Saturday, November 5 and Sunday, November 6. Friends, collaborators, and hackers can participate in a two-day Hackathon and Aaron Swartz Day Evening Reception.

Schedule of events held at the Internet Archive:

Saturday, November 5, from 10 am – 6 pm and Sunday, November 6, from 11am – 5pm — Participate in the Hackathon, which will focus on SecureDrop, the whistleblower submission system originally created by Aaron just before he passed away.

Saturday night, November 5th, from 6:30pm – 9:30pm — Celebrate and remember Aaron, and also the grand tradition of working hard to make the world a better place, at the Aaron Swartz Day Evening Celebration:

Reception: 6:30pm – 7:30pm – Come mingle with the speakers and enjoy nectar, wine & tasty nibbles.

Migrate your way upstairs: 7:30-8:00pm – We decided to give folks a little window of time to finish up their nibbles and wine at the reception, exchange contact info, and make their way upstairs to grab a seat to watch the speakers, which will begin promptly at 8pm.

Speakers 8:00pm – 9:30pm:

A Special Statement from Chelsea Manning (in celebration of this year’s Aaron Swartz Day and International Hackathon)

Tiffiniy Cheng (Co-founder and Co-director Fight for the Future)

Cindy Cohn (Executive Director, Electronic Frontier Foundation)

Shari Steele (Executive Director, Tor Project)

Yan Zhu (Security Expert, Friend of Chelsea Manning)

Alison Macrina (Founder and Executive Director, Library Freedom Project)

Conor Schaefer (DevOps Engineer, SecureDrop)

Brewster Kahle (Digital Librarian, Internet Archive) w/Vinay Goel  (Senior Data Engineer, Internet Archive)

Please RSVP to this event

For more information, contact:
lisa@lisarein.com
http://www.aaronswartzday.org

Posted in Event | 1 Comment

Election Night at the Internet Archive

The Internet Archive is informally open to our employees, their families and friends, and our community to watch the election results next Tuesday night. This is a spur-of-the-moment invitation and an experiment. If there are enough people interested, we will use the great room.

To cover the cost of pizza and soda, please purchase a $10 “ticket” on our Eventbrite.

The event will run from 6pm until the election is called — 11pm at the latest. We will limit the number of people and we reserve the right to ask anyone to leave for any reason.

If you are interested in volunteering to help that evening, please contact Salem at salem@archive.org.

Posted in News | Comments Off on Election Night at the Internet Archive

GifCities: The GeoCities Animated GIF Search Engine

 

underconstruction

dancing_babyhomer_1

 

 

line

skeletonworm          surfcpu       webfun          gif_guitarman

Try the Internet Archive’s animated GIF search engine at GifCities.org!  You can now get your early-web GIF fix and have a fun way to browse the web archive. Search for snowglobes or butterflies or balloons or (naturally) cats. If you click on a GIF, then it brings to you to the original page from the Wayback Machine. (Then please consider donating to the Archive)

One of the goals for our 20th anniversary event last week was to highlight the amusing and wacky corners of the web, as represented in our web archive, in order to provide a light-hearted, novel perspective on the history of this amazing publication platform that we have worked to preserve over the years.

The animated GIF is perhaps the iconic, indomitable filetype of the early web.  Meme-vessel, page-spacer, action-graphic-maker — GIFS are a quintessential feature of the 1990’s web aesthetic, but remain just as popular today as they were twenty years ago. GeoCities, the first major web hosting platform for individual users to create their own pages, and once the third most visited site on the web before being shut down in 2009, occupies a similarly notable place in the history of the web.

So we combined these two aspects of web history by extracting every animated GIF from GeoCities in our web archive and built a search engine on top of them. Behold, for your viewing pleasure, over 4,500,000 animated GIFs (1,600,000 unique), searchable based on filename and URL path, with most GIFs linking to the archived GeoCities web page where it was originally displayed.

Some random staff faves:

dinosaur

 skullmail  dogsruledoorsmor

landing-a

Soft-launched at our anniversary event on Wednesday, where we also projected GifCities on the side of our headquarters in San Francisco, the project has been featured in The Guardian, BoingBoing, the A.V. Club, CNET,  and others. The GeoCities GIF collection was also made available for creative reuse by artists and researchers, and featured in work such as the GifCollider project currently showing at BAMPFA (see the videos online) and the Hall of GIFs data visualization at NCSU. Shout-outs also go to others working with the GeoCities web archive, including the Geocities Research Institute and historians. More details on the project can be found at the GifCities about page.

And yes, like every other upstanding web citizen, we GifCities’ed ourselves:internet1archive1

Posted in Announcements, News | 3 Comments

Making the Web More Reliable — 20 Years and Counting

blog-wwwcheck
As a part of our 20th anniversary, here are some highlights about tools and projects, from the Internet Archive, helping to make the web a more reliable infrastructure for supporting our culture and commerce.

All in all, the Internet Archive is building collections and tools to help make the open web a permanent resource for current users and into the future.

Please donate to make it even better.

Thank you to the hundreds of people who have worked for the Internet Archive over the past 20 years, and to the thousands who have supported the Archive and contributed to the collections.

 

Posted in Announcements, News | 3 Comments

Searching Through Everything

With over 20 million items in the Internet Archive’s many collections, having a good way to search through them to find exactly what you want is crucial. It is equally important to be able to filter the data in flexible ways so that you see subsets of the data most relevant to you. We are pleased to offer two new features that might change everything about how you search.

Faceted Filtering

Once you’ve executed a site search, either from the search form at the top right of every page or by going to the search page directly, you’ll see a bunch of new checkboxes down the left-hand side, in addition to the search results. These checkboxes are grouped into categories, such as “Media Type” and “Topics & Subjects”.

Clicking any of the checkboxes adds the corresponding term to the search criteria, allowing you to more precisely define the filtered set of search results. Checkmarking more than one term within the same category causes items that match any of the selected terms to be displayed, whereas checkmarking items from two different categories means that only items matching both terms will be shown. Play around with it, and you’ll see how intuitive it is. Checking or unchecking new terms causes search results to be re-filtered on the fly.

We were looking for a way to provide a more powerful, visual approach to filtering search results. When we user-tested the faceted search interface, our testers loved it. It was a familiar interface already in use throughout the Internet which offered both simplicity and richness.

Full-Text Search (in Beta)

Every day, we see an average of 50,000 hits on our search pages, as you, our users, search for title, creator, and various other metadata about the items we’ve archived. But you have long asked when you would be able to search not only across all items but within them as well. For years you’ve been able to search within the text of a single book using our BookReader, but never before have you been able to search across and within all 9 million available text items at the Internet Archive in a single shot. Until now.

Full-Text Search

And here’s all you have to do: On the search page, after entering your search query in the text field, checkmark “Search full text of books” just underneath the text field, and then click or tap “GO”. That’s it! In seconds, you’ll have the results of searching through millions of texts. Note that the facets at the left work a little differently from non-full-text searches; just click or tap one to add it as a filter criterion.

At the moment, we’re still in beta. Suffice to say, we’ve faced quite a number of challenges in configuring and populating our full-text search engine, from creating the Elasticsearch clusters to dealing with optical character recognition (OCR) issues related to strange fonts, running page headers, or language recognition. We are continuing to make improvements, and still have a ways to go.

But please use it! Try searching for some phrase that’s stuck in your head from a book long ago forgotten, and see what comes up. You now have the contents of 9 million texts at your fingertips.

Posted in Announcements, Books Archive, News | 9 Comments